system news

Scheduled electrical maintenance, Wed Aug 19 09:00-14:00

  • Posted on: 28 April 2016
  • By: admin

On Wednesday August 19 09:00-14:00 all compute clusters will be unavailable due to scheduled maintenance of the high-voltage electrical infrastructure.

No compute jobs extending into the downtime window will start, consider scheduling shorter jobs if possible to enable maximum utilization of the systems before the downtime.

Tue, 2015-08-11 09:33 | Niklas Edmundsson

PFS down (resolved)

  • Posted on: 28 April 2016
  • By: admin

As is tradition lately, PFS is again inaccessible and attempts to use it will hang.
Batch queues have been suspended and we're poking the storage system.

Update 2015-08-08 21:39 CEST:
The filesystem is back online and the queues have been resumed.

Fri, 2015-08-07 15:05 | Lars Viklund

PFS inaccessible

  • Posted on: 28 April 2016
  • By: admin

We are again having some problems with the pfs system. File access is slow, or hanging. We are investigating and more information will follow as soon as we have it.

The queues have been paused and will be resumed when access to pfs has been restored.

Update 12:26: The PFS system is back online and all queues have been resumed. Sorry about any inconvenience the problem caused. If you notice further problems with the pfs, please report it to support@hpc2n.umu.se.

Problem with pfs

  • Posted on: 28 April 2016
  • By: admin

We are currently experiencing some problems with the pfs system. File access is slow, or hanging. We are currently looking for the cause. More information will follow as soon as we have it.

The queues have been paused and will be resumed when access to pfs has been restored.

Update 20:44: The PFS system is back online and all queues have been resumed. We apologize for any inconvenience. If you notice any further problems wiht the PFS file system please report it to support@hpc2n.umu.se

Driver issues with Abisko interconnect.

  • Posted on: 28 April 2016
  • By: admin

We are experiencing some driver issues with the infiband interconnect. This might lead to jobs failing with things like "Invalid resource type" and "Invalid CQ event".

We are in the process of updating the drivers, and we hope that this will solve the issues. Since we are unwilling to abort running jobs, the exact time when this update will be fully finished is not known. But we do expect it to be done before the end of the week.

Mon, 2015-07-06 09:38 | Roger Oskarsson

The /pfs/nobackup problems from 2015-05-27 are currently solved

  • Posted on: 28 April 2016
  • By: admin

The problems we had with the /pfs/nobackup filesystem is currently solved.

We are still waiting for the vendor to tell us exactly what happened and how to make sure it doesn't happen again.

But for the time being things are expected to be back to normal.

Some jobs may have failed due to the timeouts that resulted from the problem but as far as we can tell at the moment no files have been lost.

We apologize for this and will try to minimize the risk of something like this happening again.

/pfs/nobackup back online

  • Posted on: 28 April 2016
  • By: admin

The /pfs/nobackup file system is now back online.

Some files and directories may have been lost.
We have sent a mail to users we know are affected.

Please let us know if you find missing files or directories.

There is however no way to retrieve any lost data.

We are  sorry about this and we are working with the vendor to reduce the possibility of it happening again.

pfs read-only

  • Posted on: 28 April 2016
  • By: admin

There is currently (2015-03-23 15:11) an outage of the PFS filesystem, rendering it read-only on all HPC2N resources.

Batch queues have been suspended until the problem is resolved.

Update 2015-03-23 16:13:
The problem seems complex and it is unlikely that it will be resolved today.

Update 2015-03-24 18:13:
We believe the problem has been resolved. PFS is again accessible and batch queues have been resumed.

*Update 2015-03-24 19:30*
The file system went back to being mounted read-only. The problem is still there...

Pages

Updated: 2018-11-12, 18:32