system news

/pfs/nobackup problems

  • Posted on: 29 April 2016
  • By: admin

We are currently experience some problems with the /pfs/nobackup filesystem. Investigations are ongoing. 

All the batch queues have been stopped until further notice.

UPDATE 13:04: Everything should be back to normal state again. It seems that it was only a local problem on the Abisko login node. Jobs running through the batchsystem should not have been affected.

/pfs/nobackup unavailable

  • Posted on: 29 April 2016
  • By: admin

We are experiencing some problems with the /pfs/nobackup filesystem. Investigations are ongoing. At the moment we have no ETA when /pfs/nobackup will up agian.

All the batch queues have been stopped until further notice.

*UPDATE 2015-12-23 17:10*
The file system is now back online and batch queues have been enabled again.

/pfs/nobackup currently not responding again, ALL batch queues stopped (FIXED AGAIN)

  • Posted on: 28 April 2016
  • By: admin

The /pfs/nobackup filesystem stopped responding again.

We're working on getting it back online.

*UPDATE 2015-10-16 10:05*
The file system is now back online again and the queues are running.

*UPDATE 2015-10-16 12:35*
The file system is non-responsive again, we're trying to get it back online as soon as possible

*UPDATE 2015-10-16 20:00*
The file system is now back online and the problem has been identified.

Fri, 2015-10-16 08:28 | Åke Sandgren

Power maintenance 2015-09-18 06:30, batch nodes will be drained of running jobs.

  • Posted on: 28 April 2016
  • By: admin

The power company is doing a maintenance on the high voltage feed to the University on Friday September 18th, 06:30.

This will result in total power loss to the whole University, thus we need to drain the batch nodes.

During the days leading up to the maintenance window, it is advisable to submit shorter jobs, that can finish in the remaining time until the window starts. To allow a little bit of margin the system will not allow jobs to run after 06:20 on the 18th.

Batch queues stopped due to /pfs/nobackup being out of inodes (files). (Partially fixed)

  • Posted on: 28 April 2016
  • By: admin

We have unfortunately been forced to stop all batch queues on the clusters.

/pfs/nobackup has run out of inodes. Something probably created more inodes (files) then intended.

We are working on finding out where and getting the usage down to normal levels.

Until this is fixed we need to keep the batch queues stopped to avoid risking jobs to fail due to not being able to create new files.

pfs problems (solved)

  • Posted on: 28 April 2016
  • By: admin

The /pfs/nobackup (lustre) file system is currently unavailable due to after effects of an electric maintenance problem (see here).

We apologize for any inconvenience this may cause.

We are currently working on restoring access to pfs, but we do not have an ETA right now.

This news will be updated with more information when we have it.

*UPDATE 20150820 15:45*
It will, most likely, take at least until Friday 21th, before we can get this resolved.

/pfs/nobackup filesystem now back in production

  • Posted on: 28 April 2016
  • By: admin

The filesystem is now back in production.

As far as we can tell no files or directories where lost, but if you do find evidence of that, please notify support@hpc2n.umu.se, so we can report it to the vendor.

We're sorry about the long downtime, but we were verifying each step with the vendor, to do our best not to loose any data.

Mon, 2015-08-24 11:32 | Åke Sandgren

Electrical maintenance causing problems

  • Posted on: 28 April 2016
  • By: admin

The electrical maintenance that was scheduled for 19:00 on Wednesday evening did not go without problems.

Something went slightly wrong causing one of the UPS:es to fail. This in turn resulted in the cooling system failing, causing a quick rise of the temperature and a following emergency cut of power.

This caused so much problems that we will not be able to get things back on line until tomorrow (Thursday).

We lost, among other things, the /pfs/nobackup filesystem, which is the reason that the queues have been stopped. We expect that jobs have failed.

Pages

Updated: 2018-02-14, 15:06