Abisko login node crashed/rebooted
The Abisko login node had crashed and has been rebooted. Access should be restored.
Fri, 2015-01-30 18:00 | Birgitte Brydsö
The Abisko login node had crashed and has been rebooted. Access should be restored.
Fri, 2015-01-30 18:00 | Birgitte Brydsö
Due to an unexpected internal change in the last kernel update, both AFS (serving your normal home directory) and Lustre (serving /pfs/nobackup) stopped working.
We have therefor blocked all queues to make sure no more nodes reboot into the problematic kernel.
The problem does not affect running jobs.
We will remove this kernel and resume operation as soon as possible...
*UPDATE 2015-02-04 11:35 CET*
Both systems are now back in normal production.
The /pfs/nobackup file system is currently suffering from a misconfiguration.
The file system was created with fewer inodes than intended and we are in short supply at the moment.
This makes creation of files or directories to fail when we run out of inodes and thus may cause jobs to fail.
At the time of writing (2015-01-12 16:18 CET) we have ~4 million inodes available so the problem is not immediate but depending on what jobs are currently in the queue this could change quickly.
The batch master server for Abisko needs hardware maintenance and the /pfs/nobackup file system needs a reconfiguration.
We will begin this on Wednesday Jan 21th, starting at 08:00 CET. Unfurtunately the /pfs/nobackup maintenace took longer than expected, so we're still working on this on Thursday Jan 22.
A reservation has been put inplace on the batch systems which means that jobs will not be started unless they are short enough to finish before that time.
The login nodes of both clusters will also be disabled.
During the Christmas holidays we are running with reduced staffing.
We will try to solve any arising problems as quickly as possible, but there will be delays due to this.
We will be back at full capacity on Jan 7th.
HPC2N staff whish You all a
We had some raid problems on our batch server, meaning (among other things) that no jobs could be submitted.
It was necessary to reboot to server. Everything seems to be working correctly again.
Fri, 2014-11-14 12:03 | Birgitte Brydsö
We have some disk problems on our batch server. This means that no jobs can be submitted. We are hoping it will not affect running jobs.
Fri, 2014-11-14 11:47 | Roger Oskarsson
Friday afternoon (around 14:30) we had a power outage which affected both our cluster (Abisko and Akka). Our power company is blaming the weather.
Power should be back now and we are in the process of restarting everything.
Fri, 2014-10-24 14:52 | Roger Oskarsson
The new centre storage is now in production.
The /pfs/nobackup file system is now larger and faster, ... finally.
Almost all users have been synchronized to new new file system.
The few remaining users (those affected will get a separate mail) have been blocked from logging in and their jobs put on hold until the transfer is complete for each user.
Jobs are running again and login has been opened (see exception above).
If you notice anything strange please notify support@hpc2n.umu.se.