TensorFlow 1.5.0 now installed on Kebnekaise for both CPU and GPU
TensorFlow 1.5.0 has now been installed on Kebnekaise.
There is both a CPU and a GPU version.
Note that the module has changed name and is now TensorFlow with capital T and F.
TensorFlow 1.5.0 has now been installed on Kebnekaise.
There is both a CPU and a GPU version.
Note that the module has changed name and is now TensorFlow with capital T and F.
We are experiencing some problems with slurm (our batch system) at the moment. The reason for the problem was an upgrade of slurm did not work as well as we hoped.
We are working on the problem at the moment. This news item will be updated when we have more information.
*Update*
17:50 - Batch system should now work again.
We are currently experiencing problems with the /pfs/nobackup file system.
One of the storage units are misbehaving causing problems when accessing files located on it.
Due to this the batch queues on both Abisko and Kebnekasie have been stopped.
Login may hang for users whos login script (.bashrc) tries to access the /pfs/nobackup filesystem
We currently have no estimate for when the problem will be solved.
* UPDATE: 2018-01-24 20:00 *
The problem has now been solved.
We currently have some problems with cluster login. This affects all users intermittently.
We are working on a solution and will update this news when we have more information.
UPDATE: problem has been solved
During the coming days we will be upgrading the kernel on all systems due to the Meltdown/Spectre security bugs.
This may cause some interference in the normal behaviour of the systems like slow or temporarily missing file system access, stopped queues and similar.
We will try to minimize the user visible effects but this is a upgrade we must do as quickly as possible and may sometimes have to do things in such a way that the user experience is somewhat degraded.
Tuesday 2017-12-19 there will be a maintenance on the cooling system for the room with our cluster storage (/pfs/nobackup).
This means that we have to take that storage down and also the clusters and login nodes.
There is currently a reservation on all nodes of both clusters starting 2017-12-19 07:00 CET.
Any jobs that have a walltime requirement that would extend into that maintenance window will not be allowed to start.
We are having some infrastructure outages and it is not currently possible to log in to the clusters.
We're looking into the problem and will update this news item when more information is available.
We are experiencing some problems with our support address <support@hpc2n.umu.se> at the moment. This seems to be SNIC wide and they are working on it. This news item will be updated when more information is available.
Update 2017-11-15 14:20
Problem should be solved now and it should be possible to send in support problems to our support address again.
Today, 2017-10-18 13:40, we switched over the software stack on Kebnekaise to one that has been rebuilt from scratch.
All user level codes and most libraries and helper programs have been rebuilt from scratch.
We have tried to make sure nothing used by our users is missing, but should any job fail due to missing libraries or similar, please make sure to notify support@hpc2n.umu.se immediately and we will fix the problem.
On Sep 4th 08:00 CEST we will have a maintenance window to change some parameters for the lustre file system.
To be able to do this we need to empty the clusters from running jobs and reboot all nodes including the login nodes.
As we get closer to that point in time, jobs will not be allowed to start if their requested runtime is too long to fit before 2017-09-04 08:00.
In other words, submitting jobs with shorter runtimes will be a good idea.