system news

Unplanned cluster issues due to switchboard failure

Posted on: 15 March 2019
By: zao

2019-03-29:

Normal power routing restored to all nodes.

2019-03-28:

Repairs completed, switchboard powered up.

2019-03-25:

Due to component delivery delays the final steps of repair are postponed. The new date for completion is Wednesday 2019-03-27.

2019-03-20:

Replacement parts and cables are en route. We currently estimate installation and recertification of the switchboard to be finished by the end of Monday 2019-03-25.

2019-03-15:

2019-03-13 - Power outage at HPC2N. Clusters back up

Posted on: 13 March 2019
By: roger

There was a (25-minute) power outage at Umeå University campus just before 09:30. This brought down our clusters (Kebnekaise and Abisko and killed all running jobs. It also has affected the Kebnekaise login nodes.

Power is back now, and we are in the process of taking up the clusters. We will add more information when we know what happened and/or when the clusters are back up.

*Update* The reason for the power outage was a severed cable that triggered a circuit breaker in one of the universities internal power stations.

Intermittent pfs file system slowness

Posted on: 6 March 2019
By: bbrydsoe

Due to hardware failures in one of the storage servers we currently have reduced performance on the pfs file system.

The effect of this is that any file system operation may be slower or nonresponsive at times.

We are working on fixing the problem.

*PROBLEM FIXED*

Maintenance window, monday 2019-03-04 09:00 - 13:00, all clusters affected

Posted on: 25 February 2019
By: ake

On Monday 2019-03-04 09:00 we have a maintenance window to replace parts of the /pfs/nobackup file system.

We expect it to take a couple of hours.

All clusters will be affected, and jobs that have a requested timelimit that reaches beyond that point in time will not be allowed to start until after the service.

Lost power to the clusters. Power and clusters back

Posted on: 30 January 2019
By: roger

At around 19:20 today (30/1), the power to the clusters was lost. That means both kebnekaise and abisko is not running any jobs. All running jobs has has stopped. We are working on getting back the power and the cluster up. This message will be updates has we know more.

Login to the kebnekaise access nodes will stall until the kebnekaise cluster is back up. Abisko login node should work.

*Update* 2019-01-30 22:40: Abisko should be up and running again. Kebnekaise has some problems with /pfs and is still down. Probably will not be fixed until tomorrow.

Electrical maintenance 20190128 07:30 - 18:00, all clusters down

Posted on: 17 January 2019
By: ake

We have an electrical maintenance planned in our cluster room 20190128 07:30 - 18:00

The work requires the power to be turned off and thus all clusters will need to be drained and taken down.

We have therefor set a reservation on the clusters preventing any jobs from running during that time frame.

Problem with /pfs/nobackup file system, 2019-01-03 16:30, Resolved

Posted on: 3 January 2019
By: ake

We are currently experiencing a stability problem with the /pfs/nobackup file system.

We're fixing it as fast as we can.

*UPDATE 2019-01-03 18:17*

The stability problem should now be solved.

Power failure at Umeå University, all clusters down, 20181126 09:40, UPDATE 15:40

Posted on: 26 November 2018
By: ake

We suffered a major power failure once again.

The clusters are down and all jobs that was running are lost.

We're still waiting for power to come back.

*UPDATE 20181126 15:40*

Both Abisko and Kebnekaise are now up and running again.

Power failure in Umeå, all clusters at HPC2N down. 20181112 14:10 (Systems online again)

Posted on: 12 November 2018
By: ake

We just suffered an almost city wide power failure.

All clusters are down and all jobs that where running has been lost.

We will bring things up as soon as we can when power comes back.

*UPDATE 20181112 17:40*

The power is back and the systems are back online.

Power failure at HPC2N, both clusters down, 20181019 19:15, SOLVED 20181022 13:00

Posted on: 19 October 2018
By: ake

We have suffered a power failure at HPC2N due to cooling failure.

The power failure covers both Kebnekaise and Abisko, thus all jobs that where running have failed and no new jobs can start.

This may also result in problems login in to Kebnekaise if your login shell tries to access the /pfs/nobackup file system or load modules.

We have stopped all queues and will asses the situation as soon as we can.

*UPDATE 20181022 09:00*

The problem with the cooling system should be fixed according to Akademiska Hus.