Managing Jobs
Information on jobs
List all current jobs for a user:
squeue -u <username>
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List all current jobs in the shared partition for a user:
squeue -u <username> -p shared
List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>
List status info for a currently running job:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID:
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
To view the same information for all jobs of a user:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
Job Priority
A job's priority at any given time will be a weighted sum of all the factors that have been enabled in SLURM.
All factors below in the Job_priority formula are floating point numbers that range from 0.0 to 1.0.
The larger the number, the higher the job will be positioned in the queue, and sooner the job will be scheduled.
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_ * TRES_factor_,
...)
- nice_factor
The job priority values for the WI-HPC cluster as to the following (as of 04/26/2024):
PriorityType=priority/multifactor
site_factor=0
PriorityWeightAge=1000
PriorityWeightAssoc=0
PriorityWeightFairshare=8000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0
List job priority
sprio -j <jobid>
List priority order of jobs for the current user (you) in a given partition:
showq-slurm -o -u -q <partition>
Controlling jobs
To cancel one job:
scancel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <username>
To cancel one or more jobs by name:
scancel --name myJobName
To hold a particular job from being scheduled:
scontrol hold <jobid>
To release a particular job to be scheduled:
scontrol release <jobid>
To requeue (cancel and rerun) a particular job:
scontrol requeue <jobid>
Monitoring Resource Usage
Monitoring resources that jobs are using can be using sstat
. This monitors the resources used by all steps in a job. A number of different statistics are monitored. By default sstat only shows job steps and not the stats for the batch job itself. To see all memory use, use the -a
option. Please see the stat
section of the Slurm's sstat documentation for additional options.
As example, to check job memory consumption:
sstat -a -j 5641126 -o jobid,averss,maxrss,avevmsize,maxvmsize
JobID AveRSS MaxRSS AveVMSize MaxVMSize
------------ ---------- ---------- ---------- ----------
5641126.ext+ 0 0 0 0
5641126.0 28K 32K 28K 32K