Main HPC2N router rebooting 2024-01-03 14:30
Our main router to HPC2N will be rebooting 2024-01-03 13:30 which will cause some network interruptions.
While this is being done no new jobs will be starting.
It is expected to be back online ~15:45
Our main router to HPC2N will be rebooting 2024-01-03 13:30 which will cause some network interruptions.
While this is being done no new jobs will be starting.
It is expected to be back online ~15:45
Due to an unexpected cooling failure around 00:30 on 2023-12-04 Kebnekaise is currently down.
We are currently investigating why this happened and will bring Kebnekaise back online once we have established the cause.
*UPDATE 20231204 11:14*
Kebnekaise is now up again.
We are going to upgrade the file system for projects and home directories, /proj/nobackup and /home.
There is a maintenance window for this from 2023-11-15 08:00 - 2023-11-15 17:00.
No jobs will be allowed to run during that period and login nodes will also be disabled.
This is synchronized with electrical and cooling maintenance work on 2023-11-16 to minimize total downtime.
*** UPDATE 2023-11-17 11:25 ***
The maintenance is now finished and batch system and login nodes are back online.
We will have a maintenance window on Kebnekaise 2023-10-12 - 2023-10-13 to upgrade the batch system. It is due to an important security update of SLURM (the workload manager/batch scheduler).
From 2023-10-12 08:00 no jobs will be allowed to run, this means that jobs will not be allowed to start if their requested time limit reaches into this service window.
The maintenance window is from 2023-10-12 08:00 to 2023-10-13 17:00, but we hope to be finished before that.
There are three new nodes available on Kebnekaise.
Two of the new nodes have dual NVIDIA A100 GPUs and one is a many-core CPU node.
There are some notable differences with these nodes.
2023-06-09 A mishap with Slurm caused a loss of the job accounting data for Kebnekaise jobs today between 00:00 and 16:40
We can see no other effect on running jobs and the job queue are now open again after having been DOWN for 1 hour
If you see some other negative effect send us a support case and we'll help solving the issue
Sorry for the inconvenience that this may have caused.
Best regards,
/Support
We're currently moving the password handling to another host and during the move chaing or resetting passwords are not possible.
The service will be back up later today.
We are currently experiencing file system server problems.
This is blocking logins and is also affecting running jobs.
We're working to get it back online but currently have no ETA for this.
UPDATE 09:40
The problem with the file server has been solved and everything should work as normal now
Akademiska hus have a planned maintenance of the cooling systems for the HPC2N Infrastructure computer hall on 2023-02-01
We'll coordinate an upgrade of the central file system around their maintenance to minimize the time the cluster is draining jobs.
The combined maintenance window will therefore start on 2023-01-30 07:00 and according to our planning end on 2023-02-03 16:00
All Kebnekaise nodes, central storage and the login nodes will be unavailable during this time.
We are currently experiencing file system server problems.
This is blocking logins and is also affecting running jobs.
We're working to get it back online but currently have no ETA for this.
UPDATE 13:00
The problem with the file server has been solved and everything should work as normal now