This FAQ should help HPC2N users answer common questions regarding:
Q: I have forgotten my user password. What should I do?
Q: Where can I see how much CPU time my project used?
A: You can use command projinfo -p <project_ID> -v. For more information see our projectinfo webpage.
Q: What is the CPU Architecture of the cluster?
Abisko: look at the Abisko CPU Architecture page for more information.
Kebnekaise: read the Kebnekaise hardware page.
Q: Why can't I login with SSH Key-Based Authentication?
A: This method of authentification is explicitly disabled on HPC2N's systems. The main reason for this is that is doesn't work with AFS, the filesystem we are using. If you want to use passwordless authentification, you can access HPC2N's systems through GSSAPI-aware SSH clients. GSSAPI allows you to type your password once when obtaining your Kerberos ticket, and while that ticket is valid you don't have to retype your password. There is a little more information about this on our login/password page.
Q: Can I access the compute nodes with ssh?
A: No, we do not allow this, mainly since nodes can be shared by different user's jobs.
Q: What is the maximum time a job can run?
A: A job can run for up to the number of allocated core hours per month divided by five. However, the maximum number of (walltime) hours any job can run is 144 (or 7 days). For more information see our batch system webpage.
Q: Can I log in to computation nodes to see how my jobs are running?
A: We don't allow users to log in to computation nodes. One way to check the job status on other nodes is to use job activity graphs on our webpage Graphs of cluster nodes during jobs.
Q: What combination of nodes and cores should I use for a multi-threaded application?
A: At HPC2N we only allow processes of one user to run on a particular node. That way we prevent a situation in which a user with multi-threaded application (which runs as one process, and is thus treated by the batch system, but actually uses multiple processors) competes with other users' ordinary processes. Supposing you want to run m multi-threaded processes on n processors you need to make sure that each process is allocated to exactly one node:
For more complex configurations please contact HPC2N support: email@example.com.
Q: When submitting a job without specifying a project account, I get an error:
sbatch: error: You must supply an account (-A ...) sbatch: error: Batch job submission failed: Unspecified error
A: There is no default project. You must specify a valid project in your submit file (using the #SBATCH -A directive).
Q: I tried to submit a batchjob and got an email with a Subject similar to:
SLURM Job_id=1722566 Name=test.sbatch Failed, Run time 00:00:01
and no output/error files.
A: You have most likely specified the output file location to be on the AFS file system. However, the batch system does not have an AFS access token to be able to write there. Instead, you will have to use your personal directory on our parallel file system 'pfs' in /pfs/nobackup/u/username. Please see the File systems page for a description of various file systems at HPC2N and how to use them.
Q: I got "Unable to allocate resources: Job violates accounting/QOS policy" when I submit a job.
A: This is most likely because the project you are trying to use has expired.
You can check the status of your project with: projinfo -p <project_id>
If you got a new project update your submit file or else you can apply for a new one.
Q: My job is pending and I got "Reason=AssociationResourceLimit" or "Reason=AssocMaxCpuMinutesPerJobLimit"
A: This is because your currently running jobs allocates your entire footprint allowance for your project. The job will start when enough of your running jobs have finished that you are below the limit.
Another possibility is that your job is requesting more resources (more core hours) than your allocation permits. Remember: <cores requested> x <walltime> = <core hours you are requesting>. NOTE: If you are asking for less than six cores on Abisko you are still accounted for six cores. On Kebnekaise, if you are asking for more than 28 cores, you are accounted for a whole number of nodes, rounded up (Ex. 29 cores -> 2 nodes).
Q: I am used to using the PBS batch system. What are the main differences between that and SLURM (which is used at HPC2N)?
A: There are a number of differences between SLURM and more common systems like PBS. The most important ones are:
Comparison of some common commands in SLURM and in PBS / Maui.
|Get information about the job||scontrol show job <jobid>||qstat -f <jobid>||checkjob|
|Display the queue information||squeue||qstat||showq|
|Delete a job||scancel <jobid>||qdel|
|Submit a job||srun/sbatch/salloc||qsub|
|Display how many processors are currently free||showbf|
|Display the expected start time for a job||squeue --start||showstart <jobid>|
|Display information about available queues/partitions||sinfo/sshare||qstat -Qf|
Q: How can I control affinity for MPI tasks and OpenMP threads?
A: You can use mpirun's binding options or srun's --cpu_bind option to control the mpi task placement, or hwloc-bind (from the hwloc module) or numactl.
Q: Why do i get "unknown error" or error number 30 from cuda calls
A: The cuda library wants to write cache information into $HOME/.nv/ComputeCache. That directory is not accesible from batch jobs since it is located in the AFS file system. To solve this you have to do:
rm -rf $HOME/.nv mkdir /pfs/nobackup$HOME/.nv ln -s /pfs/nobackup$HOME/.nv $HOME/.nv
Q: I cannot access files in my home directory (file attributes are set to '?').
A: Run afslog command to obtain a new AFS token. If that does not help it is likely that your Kerberos authentication ticket has expired (run klist to check the status). To obtain a new Kerberos ticket issue the kinit command.
Q: Which files needs to be set world-readable, and why?
A: The files .bashrc, .tcshrc, and .forward all needs to be world-readable (and thus located in Public), or they will be unreadable for unauthenticated system processes. The files .bashrc and .tcshrc are needed for the batch system and the batch system can only access the parallel file system and world-readable files in your home directory (it does not have an afs token, and so is unauthenticated). Another program that is unauthenticated is sendmail, which will not be able to read your .forward file if it is not world-readable.
Q: I accidentally deleted a file. How do I restore it?
A: AFS (and thus your home directory and subdirectories of it) is backed up nightly. The newest backed up version can be found in the directory OldFiles/, found in your home directory. You can just copy the file you deleted from OldFiles/. If it has been more than 24 hours since you deleted the file, you need to contact firstname.lastname@example.org.
Q: I need to use a specific compiler version with MPI. Which modules should I add?
Add the wanted compiler toolchain, with MPI (foss, intel, etc. See our "Installed compilers" page for more information).
Read more about modules here.
Q: Why does my compilation fail with: "*** Subscription: Unable to find a server."?
A: The above message occurs when all of our PathScale compiler licenses are in use. You have to try again after a while (ca 5-10 minutes).
Q: Can I disable usage of Infiniband by OpenMPI?
A: Use parameter -mca btl '^openib' with mpiexec. Keep in mind that the option is for testing purposes only as your communication would otherwise interfere with other gigabit Ethernet traffic (especially the /pfs/nobackup file system traffic).
Q: How do I increase the stack size of an OpenMP thread when running a PathScale(TM) Fortran program?
A: Add export PSC_OMP_STACK_SIZE=128m into your submit file to set the per thread value to 128MB.
Q: How can I get access to restrictively licensed software?
A: We need to get a confirmation from a license holder that you can use the software along with a license number and/or complete license name.
Q: Should I use mpirun or srun
A: Both should work interchangeably, though mpirun may not always work with standard input (mpirun prog < file) and Intel MPI.
Q: I am seeing hwloc errors similar to L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) without inclusion! on Abisko
A: This error is benign and can be ignored. The cause is a kernel bug that will not be fixed in the current OS version used on Abisko.