Lead Image © Vladislav Kochelaevs, fotolia.com

Resource Management with Slurm

Slurm Job Scheduling System

Article from ADMIN 48/2018

By Jeff Layton

One way to share HPC systems among several users is to use a software tool called a resource manager. Slurm, probably the most common job scheduler in use today, is open source, scalable, and easy to install and customize.

In previous articles, I examined some fundamental tools for HPC systems, including pdsh [1] (parallel shells), Lmod environment modules [2], and shared storage with NFS and SSHFS [3]. One remaining, virtually indispensable tool is a job scheduler.

One of the most critical pieces of software on a shared cluster is the resource manager, commonly called a job scheduler, which allows users to share the system in a very efficient and cost-effective way. The idea is fairly simple: Users write small scripts, commonly called "jobs," that define what they want to run and the required resources, which they then submit to the resource manager. When the resources are available, the resource manager executes the job script on behalf of the user. Typically this approach is for batch jobs (i.e., jobs that are not interactive), but it can also be used for interactive jobs, for which the resource manager gives you a shell prompt to the node that is running your job.

Some resource managers are commercially supported and some are open source, either with or without a support option. The list of candidates is fairly long, but the one I talk about in this article is Slurm [4].

Slurm

Slurm has been around for a while. I remember using it at Linux Networx in the early 2000s. Over the years, it has been developed by Lawrence Livermore National Laboratory, SchedMD [5], Linux Networx, Hewlett-Packard, and Groupe Bull [6]. According to the website, Slurm provides three functions [7]: