High Performance Computing Center North
On Monday March 6 between 20:00 and midnight various core network equipment will be restarted. No jobs will be allowed to start during this maintenance window.
During a routine upgrade of the batch system SLURM on Kebnekaise intended to improve the amount of usable memory on the large memory nodes, we encountered an unexpected malfunction.
This discarded all running and scheduled jobs.
We're working on restoring functionality and will update this system news entry.
Update 2017-02-24 15:42 CET:
The new version of SLURM misbehaves greatly and the commands to interact with it are very slow or not responding at all. You may consider the Kebnekaise cluster down for all practical purposes for now.
We have resolved a long-standing problem with Emacs on the kebnekaise login node. If you had any form of X forwarding enabled invoking emacs would result in a segmentation fault which necessitated a workaround by starting it with the argument -nw.
This is no longer necessary as we have patched the offending library, so now you get a properly working graphical Emacs if you have X forwarding. If you wish to run a textual Emacs in such an environment the argument -nw still exists and works as intended.
We have been having some network problems on Abisko since late Sunday (2017-02-05) evening. These are mostly solved, but there are still some remaining issues which we are investigating. There is a risk that this may affect a small number of running jobs.
In addition, this means that a number of nodes have temporarily been taken out of production. This means there are fewer available nodes, which may affect the queue time negatively on Abisko.
This system news will be updated when there is more information.
Tonight (2017-02-06) between 20:00 and 24:00 the central network group is doing network maintenance on routers.
We have therefore put a reservation on all nodes during that time window to make sure no jobs are started.
This affects both Abisko and Kebnekaise.
After an update of the kernel, the Kebnekaise nodes failed to detect the lustre filesystem. Thus no jobs are running.
We are looking into the problem. Updates will follow in this news.
UPDATE: The problem should be solved and the nodes are back in production, and the jobs are again being scheduled.
t-mn02 was rebooted due to updates installed. During the approximately 10 minutes it took no new jobs could be submitted to Abisko.
Jobs already in the queue are not affected. This does not affect Kebnekaise.