Managing Jobs

Information on jobs

List all current jobs for a user:

squeue -u <username>

List all running jobs for a user:

squeue -u <username> -t RUNNING

List all pending jobs for a user:

squeue -u <username> -t PENDING

List all current jobs in the shared partition for a user:

squeue -u <username> -p shared

List detailed information for a job (useful for troubleshooting):

scontrol show jobid -dd <jobid>

List status info for a currently running job:

sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID:

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

To view the same information for all jobs of a user:

sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

Job Priority

A job's priority at any given time will be a weighted sum of all the factors that have been enabled in SLURM.

All factors below in the Job_priority formula are floating point numbers that range from 0.0 to 1.0.

The larger the number, the higher the job will be positioned in the queue, and sooner the job will be scheduled.

Job_priority =
    site_factor +
    (PriorityWeightAge) * (age_factor) +
    (PriorityWeightAssoc) * (assoc_factor) +
    (PriorityWeightFairshare) * (fair-share_factor) +
    (PriorityWeightJobSize) * (job_size_factor) +
    (PriorityWeightPartition) * (partition_factor) +
    (PriorityWeightQOS) * (QOS_factor) +
    SUM(TRES_weight_cpu * TRES_factor_cpu,
        TRES_weight_ * TRES_factor_,
        ...)
    - nice_factor

The job priority values for the WI-HPC cluster as to the following (as of 04/26/2024):

PriorityType=priority/multifactor
site_factor=0
PriorityWeightAge=1000
PriorityWeightAssoc=0
PriorityWeightFairshare=8000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0

List job priority

sprio -j <jobid>

List priority order of jobs for the current user (you) in a given partition:

showq-slurm -o -u -q <partition>

Controlling jobs

To cancel one job:

scancel <jobid>

To cancel all the jobs for a user:

scancel -u <username>

To cancel all the pending jobs for a user:

scancel -t PENDING -u <username>

To cancel one or more jobs by name:

scancel --name myJobName

To hold a particular job from being scheduled:

scontrol hold <jobid>

To release a particular job to be scheduled:

scontrol release <jobid>

To requeue (cancel and rerun) a particular job:

scontrol requeue <jobid>

Monitoring Resource Usage

Monitoring resources that jobs are using can be using sstat. This monitors the resources used by all steps in a job. A number of different statistics are monitored. By default sstat only shows job steps and not the stats for the batch job itself. To see all memory use, use the -a option. Please see the stat section of the Slurm's sstat documentation for additional options.

As example, to check job memory consumption:

sstat -a -j 5641126 -o jobid,averss,maxrss,avevmsize,maxvmsize

JobID            AveRSS     MaxRSS  AveVMSize  MaxVMSize
------------ ---------- ---------- ---------- ----------
5641126.ext+          0          0          0          0
5641126.0           28K        32K        28K        32K