Cluster maintenance at HPC2N 2021-03-22 - 2021-03-25, *FINISHED*

  • Posted on: 5 March 2021
  • By: ake

Dear users,

During this maintenance, 2021-03-22 - 2021-03-25, we’re going to do some upgrades on the parallel file system, where home directories and project storage is located, along with other upgrades on Kebnekaise itself.

Since this maintenance affects the parallel file system we have to drain the batch nodes from running jobs. Login sessions will be disabled and active sessions will be terminated, during that period.

If you have any data you need access to during the maintenance period, make sure that you have copied it to your own system before the maintenance starts.

During the week leading up to this maintenance only batch jobs which are short enough to end before the maintenance will be allowed to start, while longer jobs will remain in the queue until after the maintenance.

  • Parallel file system for home directories and project storage
    • speedup of the system since more features will be enabled 
    • includes patches for the the storage problems previously experienced

To check for current status on this maintenance please see our News page https://www.hpc2n.umu.se/news-and-events

If there is any questions please send them to support@hpc2n.umu.se, or if there are any problems or issues after the maintenance.

Best regards
Support @HPC2N

 

* UPDATE 2021-03-22 08:00 *

The upgrade is now in progress and no batch jobs are running. The login nodes are unavailable until we're completely finished.

 

* UPDATE 2021-03-25 08:00 *

We're running a file system check to clean out any remaining problems stemming from the file system problems we had during the autumn/winter.

This is taking a lot longer than expected and we are currently not sure when we will be finished, so the maintenance will be extended until Monday 2021-03-29 24:00.

We still hope to be finished before then but we can't currently make a good estimate.

 

* UPDATE 2021-03-27 16:30 *

We're still trying to weed out the file system problems. Our vendor is working hard to figure out exactly what the problem really is.

 

* UPDATE 2021-03-29 11:00 *

Our vendor has found and fixed the main problem we had with the file system. We are now progressing with the last rounds of file system checks.

Since this is a slow process we do not yet have an ETA on when we will be finished.

 

* UPDATE 2021-03-29 16:50 *

The file system is now working on the server side and we have started bringing batch, and login, nodes online. This process will take some time since we want to make sure we get all nodes up to the new version of the file system code and also reboot them all before taking them into production again.

After we have verified all the batch nodes we will start up a few batch jobs, wait a while and then enable all jobs to start running. When that has been running for a couple of minutes we will enable logins on the login nodes.

ETA for enabling logins is 22:00.

 

* UPDATE 2021-03-29 19:05 *

The system is now back in production.

Batchjobs are running and login nodes are enabled for login.

Updated: 2021-04-08, 16:19