FAQ

FAQ

This FAQ should help HPC2N users answer common questions regarding:

User accounts and projects

Q: I have forgotten my user password. What should I do?

A: Go here to reset your password. For this to work you need a SUPR account, and your HPC2N user account have to be connected to it.

Q: Where can I see how much CPU time my project used?

A: You can use command projinfo -p <project_ID> -v. For more information see our projectinfo webpage.

HPC2N systems

Q: What is the CPU Architecture of the cluster?

A:

Abisko: look at the Abisko CPU Architecture page for more information.

Kebnekaise: read the Kebnekaise hardware page.

Q: Why can't I login with SSH Key-Based Authentication?

A: This method of authentification is explicitly disabled on HPC2N's systems. The main reason for this is that is doesn't work with AFS, the filesystem we are using. If you want to use passwordless authentification, you can access HPC2N's systems through GSSAPI-aware SSH clients. GSSAPI allows you to type your password once when obtaining your Kerberos ticket, and while that ticket is valid you don't have to retype your password. There is a little more information about this on our login/password page.

Q: Can I access the compute nodes with ssh?

A: No, we do not allow this, mainly since nodes can be shared by different user's jobs.

Batch system and batch jobs

Q: What is the maximum time a job can run?

A: A job can run for up to the number of allocated core hours per month divided by five. However, the maximum number of (walltime) hours any job can run is 144 (or 7 days). For more information see our batch system webpage.

Q: Can I log in to computation nodes to see how my jobs are running?

A: We don't allow users to log in to computation nodes. One way to check the job status on other nodes is to use job activity graphs on our webpage Graphs of cluster nodes during jobs.

Q: What combination of nodes and cores should I use for a multi-threaded application?

A: At HPC2N we only allow processes of one user to run on a particular node. That way we prevent a situation in which a user with multi-threaded application (which runs as one process, and is thus treated by the batch system, but actually uses multiple processors) competes with other users' ordinary processes. Supposing you want to run m multi-threaded processes on n processors you need to make sure that each process is allocated to exactly one node:

  • ask the batch system for n processors (cores=n) and add the flag --tasks-per-node=1. That will "eat up" the node memory leaving no more space for any other task; See hardware for amount of memory available per node on the systems.

For more complex configurations please contact HPC2N support: support@hpc2n.umu.se.

Q: My jobs take very long to become scheduled for running on a cluster, what could be the problem?

A: The following is only relevant for Abisko. You cannot run in the default project on Kebnekaise (you will get an error). If you did not specify a valid project in your submit file (using the #SBATCH -A directive) your job will be assigned a very low priority in the job queue and run in a project account DEFAULT, which is shared among all users that don't have a SNIC project allocation (large, medium or small). This account is only given a small fraction of system resources and its main purpose is for small-scale testing.

To apply for a project please see rules described on SNIC homepage. A small level request should be sent directly to HPC2N by the Principal Investigator (PI).  You can find more information here.

Q: I tried to submit a batchjob and got an email with a Subject similar to:

SLURM Job_id=1722566 Name=test.sbatch Failed, Run time 00:00:01

and no output/error files.

A: You have most likely specified the output file location to be on the AFS file system. However, the batch system does not have an AFS access token to be able to write there. Instead, you will have to use your personal directory on our parallel file system 'pfs' in /pfs/nobackup/u/username. Please see the File systems page for a description of various file systems at HPC2N and how to use them.

Q: I got "Unable to allocate resources: Job violates accounting/QOS policy" when I submit a job.

A: This is most likely because the project you are trying to use has expired.
You can check the status of your project with: projinfo -p <project_id>

If you got a new project update your submit file or else you can apply for a new one.

Q: My job is pending and I got "Reason=AssociationResourceLimit".

A: This is because your currently running jobs allocates your entire footprint allowance for your project. The job will start when enough of your running jobs have finished that you are below the limit.

Another possibility is that your job is requesting more resources (more core hours) than your allocation permits. Remember: <cores requested> x <walltime> = <core hours you are requesting>. NOTE: If you are asking for less than six cores on Abisko you are still accounted for six cores. On Kebnekaise, if you are asking for more than 28 cores, you are accounted for a whole number of nodes, rounded up (Ex. 29 cores -> 2 nodes).

Q: I am used to using the PBS batch system. What are the main differences between that and SLURM (which is used at HPC2N)?

A: There are a number of differences between SLURM and more common systems like PBS. The most important ones are:

  1. No need to 'cd $PBS_O_WORKDIR' In SLURM your batchjob starts to run in the directory from which you submitted the script. You do not have to change to that directory with 'cd PBS_O_WORKDIR' like you do in PBS.
  2. No need to manually export environment The environment variables defined in your shell at the time you submit your script, will be exported to your batch job (in PBS you have to use the flag '-V' to achieve this). This also means any modules you have loaded before submission will be passed along by srun and sbatch.
  3. Location of output files The output and error files are created in their final location immediately, not waiting to be moved until completion, like in PBS. This means you can examine the output and error files from your job while it is running, and they are being created.

Comparison of some common commands in SLURM and in PBS / Maui.
 

Action Slurm PBS Maui
Get information about the job scontrol show job <jobid> qstat -f <jobid> checkjob
Display the queue information squeue qstat showq
Delete a job scancel <jobid> qdel  
Submit a job srun/sbatch/salloc qsub  
Display how many processors are currently free     showbf
Display the expected start time for a job squeue --start   showstart <jobid>
Display information about available queues/partitions sinfo/sshare qstat -Qf  

Q: How can I control affinity for MPI tasks and OpenMP threads? 

A: You can use mpirun's binding options or srun's --cpu_bind option to control the mpi task placement, or hwloc-bind (from the hwloc module) or numactl.

File systems

Q: I cannot access files in my home directory (file attributes are set to '?').

A: Run afslog command to obtain a new AFS token. If that does not help it is likely that your Kerberos authentication ticket has expired (run klist to check the status). To obtain a new Kerberos ticket issue the kinit command.

Q: Which files needs to be set world-readable, and why?

A: The files .bashrc, .tcshrc, and .forward all needs to be world-readable (and thus located in Public), or they will be unreadable for unauthenticated system processes. The files .bashrc and .tcshrc are needed for the batch system and the batch system can only access the parallel file system and world-readable files in your home directory (it does not have an afs token, and so is unauthenticated). Another program that is unauthenticated is sendmail, which will not be able to read your .forward file if it is not world-readable.

Q: I accidentally deleted a file. How do I restore it?

A: AFS (and thus your home directory and subdirectories of it) is backed up nightly. The newest backed up version can be found in the directory OldFiles/, found in your home directory. You can just copy the file you deleted from OldFiles/. If it has been more than 24 hours since you deleted the file, you need to contact support@hpc2n.umu.se.

Compiling and compilers

Q: I need to use a specific compiler version with MPI. Which modules should I add?

A:

Add the wanted compiler toolchain, with MPI (foss, intel, etc. See our "Installed compilers" page for more information).

For example:

ml foss

or

ml intel

Read more about modules here.

Q: Why does my compilation fail with: "*** Subscription: Unable to find a server."?

A: The above message occurs when all of our PathScale compiler licenses are in use. You have to try again after a while (ca 5-10 minutes).

Parallel Software

Q: Can I disable usage of Infiniband by OpenMPI?

A: Use parameter -mca btl '^openib' with mpiexec. Keep in mind that the option is for testing purposes only as your communication would otherwise interfere with other gigabit Ethernet traffic (especially the /pfs/nobackup file system traffic).

Q: How do I increase the stack size of an OpenMP thread when running a PathScale(TM) Fortran program?

A: Add export PSC_OMP_STACK_SIZE=128m into your submit file to set the per thread value to 128MB.

Q: How can I get access to restrictively licensed software?

A: We need to get a confirmation from a license holder that you can use the software along with a license number and/or complete license name.

Q: Should I use mpirun or srun

A: Both should work interchangeably, though mpirun may not always work with standard input (mpirun prog < file) and Intel MPI.

Updated: 2017-06-21, 09:00