system news

The /pfs/nobackup problems from 2015-05-27 are currently solved

  • Posted on: 28 April 2016
  • By: admin

The problems we had with the /pfs/nobackup filesystem is currently solved.

We are still waiting for the vendor to tell us exactly what happened and how to make sure it doesn't happen again.

But for the time being things are expected to be back to normal.

Some jobs may have failed due to the timeouts that resulted from the problem but as far as we can tell at the moment no files have been lost.

We apologize for this and will try to minimize the risk of something like this happening again.

/pfs/nobackup back online

  • Posted on: 28 April 2016
  • By: admin

The /pfs/nobackup file system is now back online.

Some files and directories may have been lost.
We have sent a mail to users we know are affected.

Please let us know if you find missing files or directories.

There is however no way to retrieve any lost data.

We are  sorry about this and we are working with the vendor to reduce the possibility of it happening again.

pfs read-only

  • Posted on: 28 April 2016
  • By: admin

There is currently (2015-03-23 15:11) an outage of the PFS filesystem, rendering it read-only on all HPC2N resources.

Batch queues have been suspended until the problem is resolved.

Update 2015-03-23 16:13:
The problem seems complex and it is unlikely that it will be resolved today.

Update 2015-03-24 18:13:
We believe the problem has been resolved. PFS is again accessible and batch queues have been resumed.

*Update 2015-03-24 19:30*
The file system went back to being mounted read-only. The problem is still there...

/pfs/nobackup not accessible

  • Posted on: 28 April 2016
  • By: admin

There is a problem with the /pfs/nobackup file system as mounted on the login nodes and compute nodes, all accesses to it will hang.

If you currently touch that filesystem in your .bashrc or other login scripts, it's likely that your login attempt will appear to be hung.

*UPDATE 2015-02-28 11:15*
The problem has been reported to the vendor and we are waiting for them to get back to us
Batch queus has been stopped so no more jobs will be affected.

*UPDATE 2015-03-01 12:45*
The problem has been fixed, so jobs and PFS should once again be operational.

Akka login node is having disk problems

  • Posted on: 28 April 2016
  • By: admin

The login node of Akka is experiencing severe disk problems.

The disk will be replaced first thing on Monday (23/2) morning and the node reinstalled.

Until then it will be kept offlline.

This does not in any way affect running or queued jobs.

Any data can of course be picked up by logging in to Abisko.

*UPDATE 2015-02-23 07:00*
The login node of Akka is now back online

Sun, 2015-02-22 17:27 | Åke Sandgren

Batch queues stopped on Abisko and Akka

  • Posted on: 28 April 2016
  • By: admin

Due to an unexpected internal change in the last kernel update, both AFS (serving your normal home directory) and Lustre (serving /pfs/nobackup) stopped working.

We have therefor blocked all queues to make sure no more nodes reboot into the problematic kernel.

The problem does not affect running jobs.

We will remove this kernel and resume operation as soon as possible...

*UPDATE 2015-02-04 11:35 CET*
Both systems are now back in normal production.

Problem with /pfs/nobackup file system

  • Posted on: 28 April 2016
  • By: admin

The /pfs/nobackup file system is currently suffering from a misconfiguration.

The file system was created with fewer inodes than intended and we are in short supply at the moment.

This makes creation of files or directories to fail when we run out of inodes and thus may cause jobs to fail.

At the time of writing (2015-01-12 16:18 CET) we have ~4 million inodes available so the problem is not immediate but depending on what jobs are currently in the queue this could change quickly.

Pages

Updated: 2017-12-15, 10:08