Batch System - Abisko/SLURM

The Batch system - Abisko/SLURM

 

Once a parallel program has been successfully compiled it can be run on multi-processor/multi-core computing nodes directly, or, in production environment, by means of a job scheduler. A job scheduler keeps track of available resources in the cluster and handles the scheduling and execution of jobs submitted by multiple users. It typically organizes submitted jobs into a three-part priority queue (queued, running, blocked). The job scheduler also  enforces resource usage and job scheduling policies.

The new cluster, Abisko, runs the open source job scheduler SLURM. It is a job scheduler similar to Torque/Maui. SLURM provides three key functions:

  1. Allocates to users exclusive or non-exclusive access to resources for some period of time.
  2. Provides a framework for starting, executing, and monitoring work on a set of allocated nodes (the cluster).
  3. Manages a queue of pending jobs, in order to distribute work across resources according to policies.


SLURM is designed to handle thousands of nodes in a single cluster, and can sustain throughput of up to 120,000 jobs per hour.