system news

Power failure in Umeå, all clusters at HPC2N down. 20181112 14:10 (*Systems online again*)

  • Posted on: 12 November 2018
  • By: ake

We just suffered an almost city wide power failure.

All clusters are down and all jobs that where running has been lost.

We will bring things up as soon as we can when power comes back.

 

*UPDATE 20181112 17:40*

The power is back and the systems are back online.

Power failure at HPC2N, both clusters down, 20181019 19:15, *SOLVED 20181022 13:00*

  • Posted on: 19 October 2018
  • By: ake

We have suffered a power failure at HPC2N due to cooling failure.

The power failure covers both Kebnekaise and Abisko, thus all jobs that where running have failed and no new jobs can start.

This may also result in problems login in to Kebnekaise if your login shell tries to access the /pfs/nobackup file system or load modules.

We have stopped all queues and will asses the situation as soon as we can.

 

*UPDATE 20181022 09:00*

The problem with the cooling system should be fixed according to Akademiska Hus.

GPU outage on Kebnekaise (resolved)

  • Posted on: 25 September 2018
  • By: zao

The GPU nodes of Kebnekaise are temporarily unavailable since the evening of Monday 2018-09-24. Jobs that started after that may have failed with messages about mismatched driver versions and may report missing GPUs.

We are addressing the problem, while work progresses nodes will be unavailable and the queue may report that jobs are blocked due to resources. We will update this news entry when this is resolved.

Kebnekaise running jobs failed 2018-09-05 18:44

  • Posted on: 6 September 2018
  • By: brorerik

Kebnekaise running jobs failed 2018-09-05 18:44

Kebnekaise cluster nodes had unexpected problems during the
maintenance of a internal service related to them.

This caused disk access timeouts, which in turn terminated
jobs running on the nodes.

The problem have been solved and the cluster is working
normally again.

We are sorry about the problems this have caused.

Maintenance work are complete

  • Posted on: 9 June 2018
  • By: torkel

The upgrade of Lustre parallel file system (/pfs/nobackup) as well as the rest of the maintenance we have done during the week are now complete. All systems are up and running again, including the batch queues on both Kebnekaise and Abisko. The login nodes are open for acess again.

During the week we have, among other things:

Maintenance affecting all HPC2N systems between 2018-06-04 and 2018-06-08

  • Posted on: 22 May 2018
  • By: ake

We have a major maintenance to upgrade the Lustre parallel file system (/pfs/nobackup).

One of the goals is to introduce new functionality, for instance project based storage.

We will also do a number of related changes and optimizations to the Lustre file system setup.

 

We are planning for a full week of complete downtime on all HPC2N systems, including login and thinlinc nodes, starting 2018-06-04 07:30 CEST.

Please make sure to copy any files you want to work with during the downtime to some other system well in advance of the maintenance.

Pages

Updated: 2018-12-12, 14:43