system news

Cluster queues suspended, 2019-12-06, Solved 2019-12-07 14:30

Posted on: 7 December 2019
By: zao

The batch queues on the clusters are suspended while we perform some troubleshooting. Jobs already running should continue to run but no new jobs will start until this concludes.

Queues back onliune 2019-12-07 14:30

Maintenance on cooling system affects Kebnekaise and Abisko, 2019-10-16 - 17

Posted on: 27 September 2019
By: ake

There will be a two day maintenance on the cooling system and power feed 2019-10-16 - 17 that affects both Kebnekaise and Abisko.

The maintenance window starts at 2019-10-16 04:00 and ends 2019-10-17 18:00

The clusters will be down during that period and no jobs will be running.

During the days leading up to the maintenance only jobs with a short enough runtime to finish before the maintenance starts will be allowed to run.

So, if you have jobs that can use a shorter runtime it is advisable to submit them during the week(s) before the maintenance window.

AFS home directories inaccessible on Kebnekaise (resolved)

Posted on: 5 August 2019
By: zao

The home directories on Kebnekaise are inaccessible since this weekend. We're aware of the problem and are looking into it the cause.
For urgent accesses, please use the Abisko login node in the meantime. The AFS and PFS file systems are accessible there even if you only have an allocation on Kebnekaise.

[Update 2019-08-05 09:45 CEST]
We've rebooted the login node to restart the affected services and access to AFS home directories is now restored.

Upgrade of Lustre servers to solve the last weeks problems 2019-07-(01-05) (clusters now UP again)

Posted on: 27 June 2019
By: torkel

The last two weeks we have had serious problems with PFS, the parallel file system. The cause to the problems was identified fairly quickly. All attempts to get a temporary fix in place over the summer have failed though.

We have therefore, in consultation with the vendor of the storage solution, decided to update the server software starting the morning of July 1. The update was originally planned to take place in the early autumn and contains a permanent fix to the problems we have seen.

The update is expected to take the whole week.

Continued severe problems with pfs (2019-06-25)

Posted on: 25 June 2019
By: bbrydsoe

The pfs (parallel file system) is still experiencing severe problems. The bug fix we implemented seemed to stabilize it for about a week, but now pfs is again down.

Both Kebnekaise and Abisko are affected, including access to the PFS filesystem from the login nodes. We recommend that you try to avoid using the PFS filesystem, since it either takes very long to access or it cannot be accessed at all.

We are currently working intensively to solve the problems. At the moment we have no ETA when the problems will be resolved.

Severe problems with the PFS filesystem (updated 2019-06-24 11:35)

Posted on: 14 June 2019
By: torkel

We are currently experiencing severe problems with the pfs (parallel file system).

Both Kebnekaise and Abisko are affected, including access to the PFS filesystem from the login nodes. Either it takes a very long time to access the files or the files are not available. We recommend that you try to avoid using the PFS filesystem.

It is working intensively to solve the problems. At the moment we have no ETA when the problems will be resolved.

Note: The batch queues for Kebnekaise and Abisko are stopped until further notice.

pfs file system slow/misbehaving (affects both Kebnekaise and Abisko)

Posted on: 13 June 2019
By: bbrydsoe

2019-06-13, 15:40

We are currently experiencing a slowdown of the pfs (parallel file system). This affects all access, both Kebnekaise and Abisko.

The queues have been stopped until further notice.

We are investigating and will update this news as soon as we have any information.

Abisko: Rack 2 lost power/down

Posted on: 10 May 2019
By: bbrydsoe

2019-05-10, 17:30

Rack two of Abisko went down due to an electric error within the rack and all jobs running on it was lost. We are looking into the electric problem.

This news will be updated when there is more information.

RESOLVED pfs file system slow/down, 2019-04-04

Posted on: 4 April 2019
By: nikke

2019-04-04:

We are experiencing severe slowdown on the /pfs/nobackup file system, affecting all accesses including running jobs.

This is caused by components in the storage system restarting for unknown reasons, investigation is ongoing.

*UPDATE* In order to identify what is going on we are forced to shut down the file system occasionally. The vendor is assisting in identifying and fixing the issue.

Unplanned cluster issues due to switchboard failure

Posted on: 15 March 2019
By: zao

2019-03-29:

Normal power routing restored to all nodes.

2019-03-28:

Repairs completed, switchboard powered up.

2019-03-25:

Due to component delivery delays the final steps of repair are postponed. The new date for completion is Wednesday 2019-03-27.

2019-03-20:

Replacement parts and cables are en route. We currently estimate installation and recertification of the switchboard to be finished by the end of Monday 2019-03-25.

2019-03-15: