system news

AFS home directories inaccessible on Kebnekaise (resolved)

  • Posted on: 5 August 2019
  • By: zao

The home directories on Kebnekaise are inaccessible since this weekend. We're aware of the problem and are looking into it the cause.
For urgent accesses, please use the Abisko login node in the meantime. The AFS and PFS file systems are accessible there even if you only have an allocation on Kebnekaise.

[Update 2019-08-05 09:45 CEST]
We've rebooted the login node to restart the affected services and access to AFS home directories is now restored.

Upgrade of Lustre servers to solve the last weeks problems 2019-07-(01-05) (clusters now UP again)

  • Posted on: 27 June 2019
  • By: torkel

The last two weeks we have had serious problems with PFS, the parallel file system. The cause to the problems was identified fairly quickly. All attempts to get a temporary fix in place over the summer have failed though.

We have therefore, in consultation with the vendor of the storage solution, decided to update the server software starting the morning of July 1. The update was originally planned to take place in the early autumn and contains a permanent fix to the problems we have seen.

The update is expected to take the whole week.

Continued severe problems with pfs (2019-06-25)

  • Posted on: 25 June 2019
  • By: bbrydsoe

The pfs (parallel file system) is still experiencing severe problems. The bug fix we implemented seemed to stabilize it for about a week, but now pfs is again down.

Both Kebnekaise and Abisko are affected, including access to the PFS filesystem from the login nodes. We recommend that you try to avoid using the PFS filesystem, since it either takes very long to access or it cannot be accessed at all.

We are currently working intensively to solve the problems. At the moment we have no ETA when the problems will be resolved.

Severe problems with the PFS filesystem (updated 2019-06-24 11:35)

  • Posted on: 14 June 2019
  • By: torkel

We are currently experiencing severe problems with the pfs (parallel file system).

Both Kebnekaise and Abisko are affected, including access to the PFS filesystem from the login nodes. Either it takes a very long time to access the files or the files are not available. We recommend that you try to avoid using the PFS filesystem. 

It is working intensively to solve the problems. At the moment we have no ETA when the problems will be resolved.

Note: The batch queues for Kebnekaise and Abisko are stopped until further notice.

*RESOLVED* pfs file system slow/down, 2019-04-04

  • Posted on: 4 April 2019
  • By: nikke

2019-04-04:

We are experiencing severe slowdown on the /pfs/nobackup file system, affecting all accesses including running jobs.

This is caused by components in the storage system restarting for unknown reasons, investigation is ongoing.

*UPDATE* In order to identify what is going on we are forced to shut down the file system occasionally. The vendor is assisting in identifying and fixing the issue.

Unplanned cluster issues due to switchboard failure

  • Posted on: 15 March 2019
  • By: zao

2019-03-29:

Normal power routing restored to all nodes.

 

2019-03-28:

Repairs completed, switchboard powered up.

 

2019-03-25:

Due to component delivery delays the final steps of repair are postponed. The new date for completion is Wednesday 2019-03-27.

 

2019-03-20:

Replacement parts and cables are en route. We currently estimate installation and recertification of the switchboard to be finished by the end of Monday 2019-03-25.

 

2019-03-15:

2019-03-13 - Power outage at HPC2N. *Clusters back up*

  • Posted on: 13 March 2019
  • By: roger

There was a (25-minute) power outage at Umeå University campus just before 09:30. This brought down our clusters (Kebnekaise and Abisko and killed all running jobs. It also has affected the Kebnekaise login nodes.

Power is back now, and we are in the process of taking up the clusters. We will add more information when we know what happened and/or when the clusters are back up.

*Update* The reason for the power outage was a severed cable that triggered a circuit breaker in one of the universities internal power stations.

Pages

Updated: 2019-08-16, 15:23