HPC2N - Support - The Batch system: Policies for the batch system

Batch system Policies

 

The policy currently states that 

  • While the user can put as many jobs as they wish into the queue, they can have no more than 8 jobs queued for scheduling (i.e. in the idle part of the queue). This means that the rest of the jobs will be in blocked status until the earlier jobs have cleared from the idle part;
  • the user can have no more than 72 jobs total (running and idle) in the queue;
  • a job will not start unless the sum of PS (processor seconds, the sum of CPU usage for all processes in the job) for the job and all running jobs with the same account/userid is within the allotted limit (referred to as MAXPS in the MAUI configuration);
  • a job is not allowed to run longer than 5 days (432000 s) regardless of the allocated CPU time.

Example: if you submit 10 jobs to the queue, the first 8 will go to 'idle' and the next 2 to 'blocked'. When one of the 'idle' jobs starts running, the newest of the 'blocked' jobs will be upgraded to 'idle' status and can be scheduled for start.

The following table explains the MAXPS limits for the different account types:

Project account SOFT MAXPS (seconds) HARD MAXPS (seconds)
None (default) 72000 1080000
SNAC Allocated CPU seconds per month divided by five SOFT MAXPS multiplied by 3.0

A job that has a larger "footprint", that is wallclock multiplied with number of CPUs, than MAXPS won't be eligable for scheduling. So, you calculate PS like this:

PS = WCLIMIT (in seconds) * #procs

and compare it to SOFT MAXPS (you find which value to use from the table above).

A job may never exceed a HARD limit. A job will normaly not be allowed to start if it exceeds a SOFT limit.
If there are NO other jobs eligible for scheduling then and only then are jobs with resource requests between SOFT and HARD limit considered for scheduling.

Example (no project, so default SOFT MAXPS)

p-bc9901 [~]$ showq -u user1
ACTIVE JOBS--------------------
JOBNAME    USERNAME   STATE  PROC  REMAINING   STARTTIME

899744       user1   Running  2    9:42:28  Thu Mar 24 10:50:09

     1 Active Job     3829 of 5080 Processors Active (75.37%)
                       635 of  635 Nodes Active      (100.00%)

IDLE JOBS----------------------
JOBNAME    USERNAME   STATE  PROC  WCLIMIT     QUEUETIME

900991       user1    Idle   8     1:00:00  Thu Mar 24 09:49:11
900992       user1    Idle   8     2:00:00  Thu Mar 24 09:50:58
900993       user1    Idle   8     2:00:00  Thu Mar 24 09:55:30
900995       user1    Idle   8     2:00:00  Thu Mar 24 10:04:42
900997       user1    Idle   8     2:00:00  Thu Mar 24 10:08:28
901013       user1    Idle   8     1:30:00  Thu Mar 24 10:43:05
901023       user1    Idle   8     2:29:00  Thu Mar 24 11:19:08
901025       user1    Idle   8     2:00:00  Thu Mar 24 11:24:38

8 Idle Jobs

BLOCKED JOBS----------------
JOBNAME    USERNAME   STATE  PROC  WCLIMIT     QUEUETIME

901069       user1    Idle   8     2:00:00  Thu Mar 24 12:42:29
901070       user1    Idle   8     1:15:00  Thu Mar 24 12:43:41
901071       user1    Idle   8     1:20:00  Thu Mar 24 13:11:23
901074       user1    Idle   8     2:00:00  Thu Mar 24 13:17:44
901108       user1    Idle   8     2:00:00  Thu Mar 24 15:33:06
901620       user1    Idle   8     1:29:00  Thu Mar 24 10:34:26
901624       user1    Idle   8     2:00:00  Thu Mar 24 10:37:25
901627       user1    Idle   8     1:25:00  Thu Mar 24 10:41:13

Total Jobs: 17   Active Jobs: 1   Idle Jobs: 8   Blocked Jobs: 8
p-bc9901 [~]$ 

The column marked with 'STATE' shows the status of the jobs. One job is running, eight jobs are in the queue, but not yet running, marked with 'Idle'. As well, there are 8 blocked jobs. PS can be found here by taking the REMAINING times from the running job(s) (and convert them to seconds) and adding them. If they (+ the job it is trying to start) are below the MAXPS limit, a new job will be allowed to start.

Here the user 'user1' has a job with 9:42:28 = 34948 seconds remaining. Since we are running on 2 processors, that gives

PS = 34948 s * 2 procs  = 69896 s.

Since this value for PS is just under the MAXPS of 72000 s, we try if a new job can start. That job has

WCLIMIT = 1:00:00 = 3600 seconds.

Since we ask for 8 processors, that gives

PS = 8 procs * 3600 s = 28800 s.

The sum of the PS of the running job and the job trying to start is

69896 s + 28800 s = 98696 s > MAXPS

so no new job(s) are allowed to start until the running job finish. Akka also has a hard limit of 8 jobs per user and 30 jobs per project for the number of jobs in the idle queue. The rest will have to wait as the running and queued jobs finish, before they become unblocked.

When and if they start also depends on which resources they are requesting. If a job is asking for, say, 10 nodes and only 8 are currently available, the job will have to wait for resources to free up. Meanwhile, other jobs with lower requirements will be allowed to start. If a job is not allowed to start because it currently breaks the policy rules, it is put in the blocked part of the queue. Later, when the user has fewer jobs running it may be moved up to the idle part of the queue, and wait for resources.  

Detailed description of scheduling

The MAUI scheduler divides the job queue in three parts.

  1. Active jobs.
    These are the jobs that are currently running.
  2. Idle jobs.
    These are the jobs that are being considered for scheduling.
  3. Blocked jobs.
    These are jobs submitted by a user but for policy reasons (rules and limits) are not yet being considered for scheduling.

Basically what happens when a job is submitted is this.

  1. The job is put in the correct part of the queue (idle or blocked) according to the policy rules.
  2. The scheduler checks if any jobs from the blocked part can be moved up to the idle part without breaking policy rules.
  3. The scheduler calculates a new priority for the jobs in the idle part.
  4. If there are available processor resources the highest priority job(s) will be started.
  5. If the highest priority job cannot be started for lack of resources, the next job that fits, without changing the predicted startwindow for any higher priority jobs, will be started (so called backfilling).

The following measurements are used in priority calculations.

  • Queued time. This is the time since the job entered the idle part of the queue.
  • Expansion factor. This is (queuetime + wallclocklimit)/wallclocklimit.
  • Resources requested. This is a combination of CPU's, memory and disk.
  • Bypass count. How many times this job has been bypassed by the backfill algorithm.
  • Quality of Service (QOS) priority. This is, among other things, a fixed base priority value depending on what QOS the user/account has.
  • Targeted queuetime. This is also based on the QOS for the job. It is a target for the jobs queuetime. As a job approaches its targeted queuetime priority will grow exponentially.
  • Fair share target. The targeted percentage of the machine that the account/user has.

Priority is then calculated as a weighted sum of these. The weights are currently biased towards fairshare target, the expansion factor and queuetime.