|
|
Once a parallel program has been successfully compiled it can be run on multi-processor/multi-core computing nodes directly or, in production environment, by means of a batch system. Batch systems keeps track of available system resources and takes care of scheduling jobs of multiple users running their tasks simultaneously. It typically organizes submitted jobs into a three-part priority queue (running, idle, blocked). The batch system is also used to enforce local system resource usage and job scheduling policies.
HPC2N has two clusters which accepts local batch jobs; Akka and Abisko. They run different job schedulers, which you can read about here on these pages.
The batch system on Akka is composed of two parts. Torque: a system resource manager (allocates and enforces limits on nodes, processors, memory, etc.), and Maui: a job scheduler (handles job scheduling policies). The jobs are scheduled according to a set of policy rules and priorities which gives the user access as fairly as possible with respect to allotted resources.
Torque is a variation of PBS.
The new cluster, Abisko, runs SLURM. It is an Open Source job scheduler, which, like Torque/Maui. provides three key functions. First, it allocates to users, exclusive or non-exclusive access to resources for some period of time. Second, it provides a framework for starting, executing, and monitoring work on a set of allocated nodes. Third, it manages a queue of pending jobs, in order to distribute work across resources according to policies.
SLURM is designed to handle thousands of nodes in a single cluster, and can sustain throughput of 120,000 jobs per hour.
|