HPC2N - Support - The Batch system: Job status

PBS Job Status

 

qstat

Using the command qstat -a will show you the jobs currently running and their ID's.

Example (run on Akka):

p-bc9901 [~/pfs]$ qstat -a 

p-mn01.hpc2n.umu.se: 
                                                            Req'd Req'd Elap
Job ID          Username Queue  Jobname  SessID NDS TSK Memory Time S  Time
----------------------------------------------------------------------------
353476.p-mn01.hpc2n. user1  batch   H2         --   1  --  --  72:00 Q   -- 
402126.p-mn01.hpc2n. user4  batch   job5       --   2  --  --  120:0 H   -- 
402127.p-mn01.hpc2n. user4  batch   job6       --   2  --  --  120:0 H   -- 
402128.p-mn01.hpc2n. user4  batch   job7       --   2  --  --  120:0 H   -- 
402129.p-mn01.hpc2n. user4  batch   job8       --   2  --  --  120:0 H   -- 
472294.p-mn01.hpc2n. user7  default temp4      --   1  --  --  05:10 Q   -- 
472295.p-mn01.hpc2n. user7  default script4    --   1  --  --  05:10 Q   -- 
472296.p-mn01.hpc2n. user7  default script3    --   1  --  --  05:10 Q   -- 
472315.p-mn01.hpc2n. user7  default script2    --   1  --  --  05:10 Q   -- 
472316.p-mn01.hpc2n. user7  default script1    --   1  --  --  05:10 Q   -- 
472317.p-mn01.hpc2n. user7  default new_job    --   1  --  --  05:10 Q   -- 
472318.p-mn01.hpc2n. user7  default parallel   --   1  --  --  05:10 Q   -- 
493066.p-mn01.hpc2n. user7  batch   tmp.sh   12922  12 --  --  32:00 R 29:47
493073.p-mn01.hpc2n. user4  batch   job_akka  8688  1  --  --  92:00 R 27:31
493074.p-mn01.hpc2n. user4  batch   my_job    8743  1  --  --  92:00 R 27:33
493075.p-mn01.hpc2n. user4  batch   my_serial 8786  1  --  --  92:00 R 27:33
493076.p-mn01.hpc2n. user4  batch   my_job2   8881  1  --  --  92:00 R 27:32
493077.p-mn01.hpc2n. user4  batch   my_job3   8923  1  --  --  92:00 R 27:32
493078.p-mn01.hpc2n. user1  batch   my_job4   8992  1  --  --  92:00 R 27:31
472319.p-mn01.hpc2n. user1  default job_akka2  --   1  --  --  05:10 Q   -- 
472320.p-mn01.hpc2n. user1  default job_akka3  --   1  --  --  05:10 Q   -- 
472321.p-mn01.hpc2n. user1  default job_akka4  --   1  --  --  05:10 Q   -- 
472322.p-mn01.hpc2n. user1  default openmp_job --   1  --  --  05:10 Q   -- 

Where 'Q' = Queued, 'R' = Running, and 'H' = Held.

The list can be very long, making it difficult to find your own runs. If that is the case, use the following command to ask for jobs submitted by a specific user:

p-bc9901 [~/pfs]$ qstat -a -u user1

p-mn01.hpc2n.umu.se: 
                                                            Req'd Req'd Elap
Job ID           Username Queue  Jobname  SessID NDS TSK Memory Time S Time
----------------------------------------------------------------------------
353476.p-mn01.hpc2n. user1  batch   H2         --   1  --  --  72:00 Q   -- 
493078.p-mn01.hpc2n. user1  batch   my_job4   8992  1  --  --  92:00 R 27:31
472319.p-mn01.hpc2n. user1  default job_akka2  --   1  --  --  05:10 Q   -- 
472320.p-mn01.hpc2n. user1  default job_akka3  --   1  --  --  05:10 Q   -- 
472321.p-mn01.hpc2n. user1  default job_akka4  --   1  --  --  05:10 Q   -- 
472322.p-mn01.hpc2n. user1  default openmp_job --   1  --  --  05:10 Q   --

checkjob

To get more information about a specific job, use the command checkjob <job_id>. You get the <job_id> either when the job is submitted, or from running the above commands. This may sometimes help you see why the batch system is not starting your job.

Example

p-bc9901 [~/pfs]$ checkjob 493716


checking job 493716

State: Idle
Creds:  user:user123  group:folk account:DEFAULT class:batch qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Mon Nov  9 16:54:40
  (Time Queued  Total: 00:00:06  Eligible: 00:00:06)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 1900M  SWAP: 2000M


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTOR

PE:  1.00  StartPriority:  -19333231
job cannot run in partition DEFAULT (idle procs do not meet 
requirements : 0 of 1 procs found)
idle procs:  56  feasible procs:   0

Rejection Reasons: [CPU          :  662][ReserveTime  :   10]


p-bc9901 [~/pfs]$ 

showq

Another useful command is showq, which shows the job queue from the perspective of Maui (the job scheduler). In many instances it gives more useful output, as it can immediately be seen how many jobs are running, idle, blocked, etc. It can be given the flag -u <username> to limit the output to the jobs belonging to that user.

p-bc9901 [~/pfs]$ showq -u user123 
ACTIVE JOBS--------------------
JOBNAME          USERNAME      STATE  PROC   REMAINING          STARTTIME


     0 Active Jobs    5200 of 5288 Processors Active (98.34%)
                       661 of  661 Nodes Active      (100.00%)

IDLE JOBS----------------------
JOBNAME          USERNAME      STATE  PROC     WCLIMIT          QUEUETIME

500207           user123      Idle     4    00:04:00  Fri Nov 13 14:26:54
500208           user123      Idle     4    00:04:00  Fri Nov 13 14:26:54
500209           user123      Idle     4    00:04:00  Fri Nov 13 14:26:55
500211           user123      Idle     4    00:04:00  Fri Nov 13 14:27:22
500212           user123      Idle     4    00:04:00  Fri Nov 13 14:27:23
500213           user123      Idle     4    00:04:00  Fri Nov 13 14:27:23

6 Idle Jobs

BLOCKED JOBS----------------
JOBNAME          USERNAME      STATE  PROC     WCLIMIT          QUEUETIME

500203           user123      Idle     1    00:30:00  Fri Nov 13 14:26:26
500204           user123      Idle     1    00:30:00  Fri Nov 13 14:26:27
500205           user123      Idle     1    00:30:00  Fri Nov 13 14:26:38
500206           user123      Idle     1    00:30:00  Fri Nov 13 14:26:38
500210           user123      Idle     1    00:30:00  Fri Nov 13 14:27:16
500214           user123      Idle     4    00:04:00  Fri Nov 13 14:27:49

Total Jobs: 12   Active Jobs: 0   Idle Jobs: 6   Blocked Jobs: 6
p-bc9901 [~/pfs]$ 

showstart

This command can be used to get a (very) rough estimate of when the job will start. Note that jobs may starter sooner or later, depending on the priority of other (newer) jobs and the speed  with which the currently running jobs finish.

You run it with

showstart <JOB-id>