HPC2N - Support: Frequently Asked Questions (FAQ)

FAQ

 


Software
 

Compiling

Batch system

This FAQ should help HPC2N users answer common questions regarding:

User accounts and projects

Q: I have forgotten my user password. What should I do?

A: When you applied for a new user account you were asked to keep a copy of your original application in a secure place. We can reset your password to the temporary one mentioned in the application form. If you don't have the form anymore, we can also send you a new copy by mail provided your address has not changed. In other cases you will have to apply for a new account using a different user account name. Let us know if you need to save some data from your old account. Please contact us via e-mail: support@hpc2n.umu.se.

Q: Where can I see how much CPU time my project used?

A: You can use command projinfo -p <project_ID> -v. For more information see our projectinfo webpage.

HPC2N systems

Q: What is the physical CPU and memory layout of HPC2N nodes?

A: Each cluster node is an SMP (Symmetric multiprocessing) system consisting of 8 (Akka) cores. In SMP identical cores have a uniform access to the local shared memory (16 GB on Akka). See image below (L1: Level 1 Cache, L2: Level 2 Cache, C: processor core, RAM: Random Access Memory, H: Hyper Transport).

Batch system and batch jobs

Q: What is the maximum time a job can run?

A: A job can run for up to the number of allocated CPU hours per month divided by five. However, the maximum number of CPU hours any process can run is 120 (or 5 days). For more information see our batch system webpage.

Q: My job does not run/was placed in the "Blocked" part of job queue, why?

A: The reason depends on the situation. Running the command checkjob <JobID> can provide you with more information. At the end of the output you can normally see the reason why a certain job was not allow to run at that time. Typicall reasons include:

  • Scheduling problems; for instance, the requested job walltime may not exceed SOFT MAXPS (see example below);
  • requested resources were not available at the time (pmem, pvmem, node or core allocations are not available or too high);
  • job is in a hold state (see Job Holds for more information).

For example, if you requested a walltime which exceeded SOFT MAXPS limit (see The Batch system at HPC2N) checkjob will report a similar message:

job cannot run in partition DEFAULT. (job 163022 
violates active SOFT MAXPS limit of 28800000 for 
acct SNICXXX-YY-ZZZ (R: 21600000, U: 8327520))

As can be seen above the requested time plus time being used by your other jobs (R: 21600000 + U: 8327520) exceeds SOFT MAXPS limit of 28800000 CPUseconds. In such a case you either have to decrease walltime of your current job or wait until one or more of your running jobs finish (after which the current job will be moved automatically from "Blocked" to "Idle" part of the queue).

Q: Can I log in to computation nodes to see how my jobs are running?

A: We usually don't allow users to log in to computation nodes. One way to check the job status on other nodes is to use job activity graphs on our webpage Graphs of cluster nodes during jobs.

Q: What combination of nodes and ppn should I use for a multi-threaded application?

A: At HPC2N we only allow processes of one user to run on a particular node. That way we prevent a situation in which a user with multi-threaded application (which runs as one process, and is thus treated by the batch system, but actually uses multiple processors) competes with other users' ordinary processes. Supposing you want to run m multi-threaded processes on n processors you need to make sure that each process is allocated to exactly one node:

  • ask the batch system for n processors (nodes=n) and request pmem close to the node maximum (ca 15000mb on Akka). That will "eat up" the node memory leaving no more space for any other task; set virtual memory pvmem (more or less) above the pmem allocation. Maximum available virtual memory per node is ca 40000mb on Akka.

For more complex configurations please contact HPC2N support: support@hpc2n.umu.se.

Q: My jobs take very long to become scheduled for running on a cluster, what could be the problem?

A: If you did not specify a valid project in your submit file (using #PBS -A directive) your job will be assigned a low priority in the job queue and run in a project account DEFAULT, which is shared among all users that don't have a SNIC project allocation (large, medium or small). This account is only given a small fraction of system resources and its main purpose is for small-scale testing.

To apply for a project please see rules described on SNIC homepage. A small level request  should be sent directly to HPC2N by the Principal Investigator (PI).  You can find more information here.

Q: I tried to submit a batchjob and got an email with an error message similar to:

Unable to copy file 
/var/spool/torque/spool/<PBS job id>.OU to 
/home/u/username/jobfile.out
>>> error from copy 
/bin/cp: cannot create regular file 
`/home/u/username/jobfile.out': 
Permission denied 
>>> end error output

A: You have specified the output file location to be on the AFS file system. However, the batch system does not have an AFS access token to be able to write there. Instead, you will have to use your personal directory on our GPFS parallel file system in /pfs/nobackup/u/username. Please see the page about File systems for a description of various file systems at HPC2N and how to use them.
 

File systems

Q: I cannot access files in my home directory (file attributes are set to '?').

A: Run afslog command to obtain a new AFS token. If that does not help it is likely that your Kerberos authentication ticket has expired (run klist to check the status). To obtain a new Kerberos ticket issue the kinit command.

Q: Which files needs to be set world-readable, and why?

A: The files .bashrc, .tcshrc, and .forward all needs to be world-readable (and thus located in Public), or they will be unreadable for unauthenticated system processes. The files .bashrc and .tcshrc are needed for the batch system and the batch system can only access the parallel file system and world-readable files in your home directory (it does not have an afs token, and so is unauthenticated). Another program that is unauthenticated is sendmail, which will not be able to read your .forward file if it is not world-readable.

Q: I accidentally deleted a file. How do I restore it?

A: AFS (and thus your home directory and subdirectories of it) is backed up nightly. The newest backed up version can be found in the directory OldFiles/, found in your home directory. You can just copy the file you deleted from OldFiles/. If it has been more than 24 hours since you deleted the file, you need to contact support.

Compiling and compilers

Q: I need to use a specific compiler version with MPI. Which modules should I add?

A: First add the compiler, then MPI module. For example: module add intel-compiler/10.1 openmpi/intel. If you would not need a specific compiler version it would be enough to write: module add openmpi/intel which would add the default version of the compiler.

Q: Why does my compilation fail with: "*** Subscription: Unable to find a server."?

A: The above message occurs when all of our PathScale compiler licences are in use. You have to try again after a while (ca 5-10 minutes).

Parallel Software

Q: What is the difference between mpich and mvapich?

A: mvapich is an implementation of mpich to make efficient usage of Infiniband network.

Q: Can I disable usage of Infiniband by OpenMPI?

A: Use parameter -mca btl '^openib' with mpiexec. Keep in mind that the option is for testing purposes only as your communication would otherwise interfere with other gigabit Ethernet traffic (especially the /pfs/nobackup file system traffic).

Q: How do I increase the stack size of an OpenMP thread when running a PathScale(TM) Fortran program?

A: Add export PSC_OMP_STACK_SIZE=128m into your submit file to set the per thread value to 128MB.

Q: How can I get access to a licenced software (e.g. VASP)?

A: We need to get a confirmation from a licence holder that you can use the software along with a licence number and/or complete licence name.