HPC systems can benefit from administrator-defined prolog and epilog scripts.

Prolog and Epilog Scripts

Virtually all HPC systems use a resource manager, also called a job scheduler. This tool runs applications on behalf of the user when the free resources match what the user requests. Of the numerous resource managers over the years, the current popular ones are Slurmportable batch system (PBS) variantsGrid Engine variants, and IBM Spectrum LSF. Resource managers are absolutely key to allowing users to share systems.

A part of the resource manager is the ability to run administrator-defined scripts before the user's application starts and after it finishes. These scripts afford the administrator a great deal of control and flexibility in configuring or re-configuring the nodes for the next user.

Classically, these scripts are referred to as the prolog scripts that run before the application (from prologue, an introductory piece in a literary or musical work) and the epilog scripts that are run after the application finishes (from epilogue, a section at the end of a book). In this article, the words prolog and epilog will be used because they are in the HPC vernacular and have been for some time.

Resource Manager

The common resource managers used today can execute prolog and epilog scripts with root permissions. Each resource manager is slightly different, but fundamentally they all will execute the same core script. However, how, when, and which scripts are executed varies by resource manager. You might expect the scripts to be run on all the nodes that users have allocated to them by the resource manager, but if you are not careful, they will only be run on the first node that is allocated.

Some resource managers have several prolog and epilog script options that extend more precise control. For example, Slurm has several “different” versions for prolog and epilog scripts. An online reference explains when and where they are executed. The reference also has some good suggestions about the scripts in general, such as not making them too long and the specific Slurm commands they can call, which simplifies the scripts and avoids resource manager conflicts that would then affect all users.

Most prolog and epilog scripts are written in a shell script, Bash being the most prevalent. You should sharpen up on your Bash-fu when writing these scripts because mistakes will affect all users. It is best to test them on a test node with its own queue that only you, as a user, can use (don’t just test as root). You can also add a trusted user to the queue to get their feedback before trying the scripts in production. Would you rather have one or two frustrated users or many frustrated users?

However, the scripts do not have to be written in Bash. You can use whatever tool you like and are comfortable with. For example, they can be written in Python or Perl. However, using these languages can complicate the situation because they will need to be installed on every node the same way. Starting the interpreters for these languages will also take more time than just running Bash scripts.

You could even write the scripts in a compiled language such as C, or have the Bash script use a compiled program. It is recommended that you statically link the compiled binaries so you can put them on the nodes you want or need without having to drag in the compiler, libraries, and tools. As with the interpreted scripts, be sure you test these binaries on an isolated node before putting them into production. I would also recommend testing them on a node without the development tools, libraries, and compilers, to make sure the static compilation pulled in everything needed.

Regardless of how the scripts are written, be sure you document them. Documentation of administrator-written scripts and tools is seriously lacking: You might not always be the admin for the system, so someone will have to come in and take over. Proper documentation will greatly help them.

The following examples of prolog and epilog scripts from the Internet illustrate what you can do.

Prolog

As a reminder, the prolog scripts run before the user’s job and are run with root permissions. If you want them to run on all nodes allocated to the user, be sure you read the documentation for your resource manager and set the correct parameters.

Prolog scripts have many uses – pretty much anything you can imagine, including configuring the user’s environment for running their application, clearing out any unneeded files or data from previous users, setting up specific storage directories on specific storage systems (e.g., high-speed scratch storage), copying the user’s input to the storage directories, and copying back any output data to the user’s /home directory or group storage directory (although this is likely an epilog script).

Configuring the user’s environment can include loading specific environment modules or configuring environment variables for GPU environments. The prolog script can base these actions on the directives in the user’s job file or the type of node assigned to the user.

A common use is a prolog for setting up storage for the user. Because high-speed storage space is more expensive than large amounts of and slower storage, creating directories specific to high-speed storage and specific to the user is a common prolog task, which is also true for local storage space in assigned nodes. Because space is limited, the prolog might also clear out “old” directories for that user or directories tied to other users.

To begin, I'll look at some examples of what you can do in a prolog file. If you are not familiar with prolog and epilog files, the examples start simple and get more involved. I found these examples on the Internet, so I cannot vouch for their correctness. Please test them on an isolated node before putting them into production. Also, please do not assume I wrote the scripts. I deserve neither the praise nor the blame for them. However, if you find errors or a better way to write them, please share with everyone.

Example 1: Start Simply

The easiest way to learn how to write a prolog is to write something simple that should not cause any problems. A very simple script is to write the JOB ID and list of nodes for the user’s job to stdout (Listing 1).

Listing 1: List Job ID and Nodes

#!/bin/sh
#
# Sample TaskProlog script that will print a batch job's
# job ID and node list to the job's stdout
#
 
if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]
then
  echo "print =========================================="
  echo "print SLURM_JOB_ID = $SLURM_JOB_ID"
  echo "print SLURM_NODELIST = $SLURM_NODELIST"
  echo "print =========================================="
fi

This prolog is specific to Slurm but could be easily adapted to any resource manager.

Example 2: User directories

This second example is more involved. The code in Listing 2, although fairly simple, uses the find command to clear everything in a node’s /tmp and /scratch if no users are logged in and then creates two directories in /scratch and does a chown to the user.

Listing 2: Clear /tmp and /scratch

#!/bin/sh
 
DIR=/var/spool/bash-login-check/
 
# If no users are logged in we can clear everything in /tmp and /scratch
if [ ! -d $DIR -o `find $DIR -name 'uid_*' | wc -l` -eq 0 ] ; then
    find /scratch -path '/scratch/.com' -prune -or -path '/scratch/.usertmp/' -prune -or -delete
    find /scratch/.usertmp -mindepth 1 -delete
fi
 
user=$SLURM_JOB_USER
tmp=/scratch/$SLURM_JOBID
mkdir -m 700 $tmp && chown $user.$user $tmp
tmp=/scratch/fhgfs_$SLURM_JOBID
mkdir -m 700 $tmp && chown $user.$user $tmp
 
/usr/local/sbin/auditd-check
/usr/local/sbin/bash-login-update -a $SLURM_JOB_UID

In the next to last line in the script, it uses the Linux Auditing System. This system collects information about the node that can then be used for a variety of tasks. It uses the userspace component auditd-check, which writes the audit records to storage. The use of this tool is entirely up to you, but if it does not affect the execution and performance of the user’s workload, I don’t see a problem using it.

Example 3: Setting Temporary I/O Directories

Very close to the previous example is one that creates temporary directories for the user (Listing 3). The first loop creates a directory for the user in /scratch/dev/shm/jobs, and /tmp if they do not exist; does a chown to the user; and sets the permissions. A second script (Listing 4) from the same source sets three environment variables pointing to the three directories that were previously created for users to use in their scripts and defines timeouts for interactive sessions. Finally, it sets the OpenMP environment variable OMP_NUM_THREADS to match the user’s requested resources.

Listing 3: Create Temporary Directory

#!/bin/bash
 
# Create temporary directories for the job
for DIRECTORY in /scratch/jobs/$SLURM_JOB_USER /dev/shm/jobs/$SLURM_JOB_USER /tmp/jobs/$SLURM_JOB_USER
do
if [ ! -d "$DIRECTORY" ]; then
    mkdir -p $DIRECTORY
    chown $SLURM_JOB_USER $DIRECTORY
    chmod 700 $DIRECTORY
fi
TDIRECTORY=$DIRECTORY/$SLURM_JOBID
 
if [ ! -d "$TDIRECTORY" ]; then
    mkdir -p $TDIRECTORY
    chown $SLURM_JOB_USER $TDIRECTORY
    chmod 700 $TDIRECTORY
fi 
done

Listing 4: Create Environment Variables

#!/bin/bash
 
# Exposing the environment variables pointing to the temporary folders
echo "export SHM_DIR=/dev/shm/jobs/$USER/$SLURM_JOBID"
echo "export TMP_DIR=/tmp/jobs/$USER/$SLURM_JOBID"
echo "export SCRATCH_DIR=/scratch/jobs/$USER/$SLURM_JOBID"
 
# Define a timeout for idle interactive job sessions
echo "export TMOUT=300"
 
# Exporting a default value of OMP_NUM_THREADS for those jobs 
# requesting multiple CPUs per task.
if [ $SLURM_CPUS_PER_TASK -ge 1 ]; then
    echo "export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK"
fi

Example 4: Nvidia DCGM

The Nvidia Data Center GPU Manager (DCGM) tool collects GPU information. It can be used to gather statistics on specific GPUs on the nodes used by the job – only while the job is running – which allows a more fine-grained analysis of how the GPU is being used. A good blog article, albeit a couple of years old, talks about how to do this. The blog begins by showing how the process works manually; then, it has an example that uses Slurm prolog and epilog scripts.

The Slurm prolog from the article is shown in Listing 5. It begins by defining the DCGM group for the run. In this case, the group is for all of the GPUs in the node (don’t forget, the prolog should be run on all nodes in the job). The DCGM is then enabled to gather stats on the GPUs in the group (the -e option) for a specific job ID (the -s option).

Listing 5: Prolog to Gather GPU Statistics

# DCGM job statistics
group=$(sudo -u $SLURM_JOB_USER dcgmi group -c allgpus --default)
if [ $? -eq 0 ]; then
  groupid=$(echo $group | awk '{print $10}')
  sudo -u $SLURM_JOB_USER dcgmi stats -g $groupid -e
  sudo -u $SLURM_JOB_USER dcgmi stats -g $groupid -s $SLURM_JOBID
fi

A matching epilog is shown in Listing 6 for completeness. It writes the detailed job statistics report to the working directory of the job.

Listing 6: Epilog to Gather GPU Statistics

# DCGM job statistics
OUTPUTDIR=$(scontrol show job $SLURM_JOBID | grep WorkDir | cut -d = -f 2)
sudo -u $SLURM_JOB_USER dcgmi stats -x $SLURM_JOBID
sudo -u $SLURM_JOB_USER dcgmi stats -v -j $SLURM_JOBID | \
    sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out

You could easily modify this script to copy the report to a central location for administrators, as well as the user.

Other Prolog Examples

If you use XALT for tracking jobs on the system, you could use a prolog script to set up the environment variables and paths properly. Although you can configure XALT always to be used by the user, depending upon how it is configured, users can disable it accidentally or deliberately. With a prolog, you can set the proper XALT paths, including those for Python. Three lines that you could include in the prolog are:

echo "export PATH=/share/apps/xalt/bin:$PATH"
echo "export PYTHONPATH=/share/apps/xalt/libexec:$PYTHONPATH"
echo "export PYTHONPATH=/share/apps/xalt/site:$PYTHONPATH"

Finally, Microway has an extensive set of scripts as part of their Microway Cluster Management Software (MCMS). Several scripts have a number of examples of things you can do in a prolog or epilog script.

Epilog

In general, epilog scripts run with root permissions after a user’s job has completed. Typically, the script is run on all nodes used in the job, but be sure to check your resource manager documentation on how to make sure it runs on all nodes.

Epilog scripts are great for cleaning up after the user. For example, if directories were created for the user, then a prolog can copy all the user data to a specific location (e.g., the user’s /home directory) – if all goes well, without going over quota.

Another great thing for epilog scripts is “cleaning” up the node to get ready for the next user. Also, you can run a node “health check” to make sure the node is healthy; however, you want to define “healthy” before the next user runs. If it does not meet your standards, you can just mark the node as offline and send an email to the admin alias with the message to take a look at the node.

I have heard about one method for cleaning up the node before the next user that re-inits the node. This technique doesn't reboot the node; rather, it runs the command init 3, which restarts the node in multiuser mode with networking. It’s very fast compared with rebooting, and it can clear up some issues. However, I’ve never tested this, and I don’t know how this interacts with the resource manager. I recommend testing this process before putting it into production.

As with the prolog scripts, I found these examples online. If you see any way to improve them, please let everyone know. I take neither credit nor blame for them.

Example 1: Clean Up Scratch Directories

The first epilog script example shows what you can do to clean up scratch directories on the nodes used by the application (Listing 7). In the epilog script, bash-login-update appears to be a local script (the path points to /usr/local). The comment just above that line indicates that it deletes the user”s /tmp files when the job is done, which is not a bad idea if you are worried about filling up /tmp.

Listing 7: Clean Up Scratch Directories

#!/bin/sh
 
# bash-multi-login-update deletes the users /tmp files when they have no more
# jobs running, so we just have to delete the job specific folders.
/usr/local/sbin/bash-login-update -r $SLURM_JOB_UID
 
find /scratch/$SLURM_JOBID -delete
find /scratch/fhgfs_$SLURM_JOBID -delete
 
if [ -x /com/sbin/slurm-sanity-check ]; then
    reason=`/com/sbin/slurm-sanity-check -r -v`
    sane=$?
    if [ $sane -ne 0 ] ; then
        /opt/slurm/bin/scontrol update NodeName=`hostname -s` State=DRAIN Reason="$reason"
        exit 0
    fi
fi

The next two lines, beginning with the find command, erase the specific user directories in /scratch. In this script, the files are not copied to the user's /home directory. The script relies on the user doing the data copying.

The slurm-sanity-check custom script is used to detect system errors, such as missing physical memory, not being able to ping a specific node (probably to check whether the node still has network connectivity), checking filesystem mountpoints, and so on. Remember that this is an epilog script, but you could use it as part of a prolog script to check the state of the node before the user workload is executed.

If some aspect of the sanity check is not passed, the Slurm command scontrol is used to view or modify the Slurm configuration and state for that node. In this case, it will mark the node as DRAIN if it does not pass the sanity check in the previous slurm-sanity-check command.

Example 2: NHC

A fairly popular action to take in an epilog script is to check the “health” of the node after the user’s application or workflow has finished. This action checks aspects of the node, making sure it is ready for the next user. An example tool for doing this is called Node Health Check (NHC).

NHC can be integrated easily into Slurm, Torque (now OpenPBS or PBS Professional), and Grid Engine. Read through the documentation for NHC to see how this can be simply configured.

What Belongs in the Prolog and Epilog?

Generally, you can put “whatever you want or need” in the prolog and epilog scripts, which is second most common HPC answer, just behind, “it depends.” You do not have to use a prolog or an epilog script if you do not want. It’s absolutely up to you as the administrator of the system.

My recommendation is that for smaller systems, start with scripts that are simple. Begin with the very first prolog script that writes the job ID and list of nodes for the user’s job as a generally harmless script that you can use as a building block. In general, it should not affect the user’s application, which is what you want. However, it could confuse users if the output does not match what they expect, but that’s about the worst it could do. You might let all the users know that the output from the resource manager will look slightly different before you put the prolog script into production.

If you have GPUs in your system, I would highly recommend the DCGM prolog and epilog scripts. This information can be extremely useful for the user in explaining how the GPUs are used. It could also be useful information for the administrator because they can look to see whether the user is using any of the GPUs – or at least one of them if there is more than one per node.

As an administrator, you probably want to know what applications are being run on the system. XALT can help you gather this information. A prolog script that configures the paths to XALT can result in the gathering of this data. I highly recommend XALT, so adding the paths to your prolog script is a good idea.

For larger systems, and depending on the storage configuration, consider your storage policies. Do the policies restrict users to certain filesystems? Do you erase data from high-speed scratch filesystems that are older than some threshold? You might want to consider prolog and epilog scripts that create specific directories for the user in a prolog script, copy data back from these directories to the user’s more permanent storage in an epilog script, or both.

If your systems are larger or if the nodes are expensive and you are concerned about every single node, you might want to consider running a node health check in the epilog script. The NHC tool has been used a long time and provides some great information. Running a short health check such as NHC as part of an epilog can prove to be particularly useful. In general, however, I highly recommend NHC even for small systems to check the state of the nodes.

Above all, no matter what you put in the scripts, be sure to test them thoroughly before putting them into production. Moreover, do not start using them on a Friday at the end of a day, or Monday will truly be an ugly day. Instead, try putting them into production on a Tuesday or Wednesday so users can exercise them, get used to them, and ask questions if need be. Also, this gives you time to correct any mistakes or problems. Finally, you should tell all users about changes to the prolog and epilog scripts and that their output from the resource manager will look different from earlier output. Give the users an example of the changes to the output so they will not be surprised.