|
|
The policy currently states that
-
While the user can put as many jobs as they wish into the queue, they can have no more than 8 jobs queued for scheduling (i.e. in the idle part of the queue). This means that the rest of the jobs will be in blocked status until the earlier jobs have cleared from the idle part;
-
the user can have no more than 72 jobs total (running and idle) in the queue;
-
a job will not start unless the sum of PS (processor seconds, the sum of CPU usage for all processes in the job) for the job and all running jobs with the same account/userid is within the allotted limit (referred to as MAXPS in the MAUI configuration);
-
a job is not allowed to run longer than 5 days (432000 s) regardless of the allocated CPU time.
Example: if you submit 10 jobs to the queue, the first 8 will go to 'idle' and the next 2 to 'blocked'. When one of the 'idle' jobs starts running, the newest of the 'blocked' jobs will be upgraded to 'idle' status and can be scheduled for start.
The following table explains the MAXPS limits for the different account types:
|
Project account |
SOFT MAXPS (seconds) |
HARD MAXPS (seconds) |
|
None (default) |
72000 |
1080000 |
|
SNAC |
Allocated CPU seconds per month divided by five |
SOFT MAXPS multiplied by 3.0 |
A job that has a larger "footprint", that is wallclock multiplied with number of CPUs, than MAXPS won't be eligable for scheduling. So, you calculate PS like this:
PS = WCLIMIT (in seconds) * #procs
and compare it to SOFT MAXPS (you find which value to use from the table above).
A job may never exceed a HARD limit. A job will normaly not be allowed to start if it exceeds a SOFT limit.
If there are NO other jobs eligible for scheduling then and only then are jobs with resource requests between SOFT and HARD limit considered for scheduling.
Example (no project, so default SOFT MAXPS)
p-bc9901 [~]$ showq -u user1
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
899744 user1 Running 2 9:42:28 Thu Mar 24 10:50:09
1 Active Job 3829 of 5080 Processors Active (75.37%)
635 of 635 Nodes Active (100.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
900991 user1 Idle 8 1:00:00 Thu Mar 24 09:49:11
900992 user1 Idle 8 2:00:00 Thu Mar 24 09:50:58
900993 user1 Idle 8 2:00:00 Thu Mar 24 09:55:30
900995 user1 Idle 8 2:00:00 Thu Mar 24 10:04:42
900997 user1 Idle 8 2:00:00 Thu Mar 24 10:08:28
901013 user1 Idle 8 1:30:00 Thu Mar 24 10:43:05
901023 user1 Idle 8 2:29:00 Thu Mar 24 11:19:08
901025 user1 Idle 8 2:00:00 Thu Mar 24 11:24:38
8 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
901069 user1 Idle 8 2:00:00 Thu Mar 24 12:42:29
901070 user1 Idle 8 1:15:00 Thu Mar 24 12:43:41
901071 user1 Idle 8 1:20:00 Thu Mar 24 13:11:23
901074 user1 Idle 8 2:00:00 Thu Mar 24 13:17:44
901108 user1 Idle 8 2:00:00 Thu Mar 24 15:33:06
901620 user1 Idle 8 1:29:00 Thu Mar 24 10:34:26
901624 user1 Idle 8 2:00:00 Thu Mar 24 10:37:25
901627 user1 Idle 8 1:25:00 Thu Mar 24 10:41:13
Total Jobs: 17 Active Jobs: 1 Idle Jobs: 8 Blocked Jobs: 8
p-bc9901 [~]$
The column marked with 'STATE' shows the status of the jobs. One job is running, eight jobs are in the queue, but not yet running, marked with 'Idle'. As well, there are 8 blocked jobs. PS can be found here by taking the REMAINING times from the running job(s) (and convert them to seconds) and adding them. If they (+ the job it is trying to start) are below the MAXPS limit, a new job will be allowed to start.
Here the user 'user1' has a job with 9:42:28 = 34948 seconds remaining. Since we are running on 2 processors, that gives
PS = 34948 s * 2 procs = 69896 s.
Since this value for PS is just under the MAXPS of 72000 s, we try if a new job can start. That job has
WCLIMIT = 1:00:00 = 3600 seconds.
Since we ask for 8 processors, that gives
PS = 8 procs * 3600 s = 28800 s.
The sum of the PS of the running job and the job trying to start is
69896 s + 28800 s = 98696 s > MAXPS
so no new job(s) are allowed to start until the running job finish. Akka also has a hard limit of 8 jobs per user and 30 jobs per project for the number of jobs in the idle queue. The rest will have to wait as the running and queued jobs finish, before they become unblocked.
When and if they start also depends on which resources they are requesting. If a job is asking for, say, 10 nodes and only 8 are currently available, the job will have to wait for resources to free up. Meanwhile, other jobs with lower requirements will be allowed to start. If a job is not allowed to start because it currently breaks the policy rules, it is put in the blocked part of the queue. Later, when the user has fewer jobs running it may be moved up to the idle part of the queue, and wait for resources.
Detailed description of scheduling
The MAUI scheduler divides the job queue in three parts.
-
Active jobs.
These are the jobs that are currently running.
-
Idle jobs.
These are the jobs that are being considered for scheduling.
-
Blocked jobs.
These are jobs submitted by a user but for policy reasons (rules and limits) are not yet being considered for scheduling.
Basically what happens when a job is submitted is this.
-
The job is put in the correct part of the queue (idle or blocked) according to the policy rules.
-
The scheduler checks if any jobs from the blocked part can be moved up to the idle part without breaking policy rules.
-
The scheduler calculates a new priority for the jobs in the idle part.
-
If there are available processor resources the highest priority job(s) will be started.
-
If the highest priority job cannot be started for lack of resources, the next job that fits, without changing the predicted startwindow for any higher priority jobs, will be started (so called backfilling).
The following measurements are used in priority calculations.
-
Queued time. This is the time since the job entered the idle part of the queue.
-
Expansion factor. This is (queuetime + wallclocklimit)/wallclocklimit.
-
Resources requested. This is a combination of CPU's, memory and disk.
-
Bypass count. How many times this job has been bypassed by the backfill algorithm.
-
Quality of Service (QOS) priority. This is, among other things, a fixed base priority value depending on what QOS the user/account has.
-
Targeted queuetime. This is also based on the QOS for the job. It is a target for the jobs queuetime. As a job approaches its targeted queuetime priority will grow exponentially.
-
Fair share target. The targeted percentage of the machine that the account/user has.
Priority is then calculated as a weighted sum of these. The weights are currently biased towards fairshare target, the expansion factor and queuetime.
|