Batch queue malfunction on Kebnekaise

  • Posted on: 24 February 2017
  • By: zao

During a routine upgrade of the batch system SLURM on Kebnekaise intended to improve the amount of usable memory on the large memory nodes, we encountered an unexpected malfunction.

This discarded all running and scheduled jobs.

We're working on restoring functionality and will update this system news entry.

Update 2017-02-24 15:42 CET:
The new version of SLURM misbehaves greatly and the commands to interact with it are very slow or not responding at all. You may consider the Kebnekaise cluster down for all practical purposes for now.

Update 2017-02-24 23:16 CET:
We have reverted most of the batch system components to the previous version and the batch queue should be up again. Please report any strange behaviours you may notice regarding the batch system.

We're extremely sorry about the jobs lost in this incident, the intent with this upgrade was to improve the capabilities of the cluster without affecting running jobs.

Updated: 2017-03-22, 15:55