pdsh Parallel Shell

Summary

A tool that allows you to run commands on a range of nodes is probably the most fundamental tool an HPC admin can use. Even for experienced admins, such an easy-to-use tool can help you understand quickly the state of your system. Arguably, the most popular parallel shell is pdsh. It is easy to use and flexible and has very useful modules to extend its capability.

The pdsh tool can be used on the cluster in a number of ways. An extremely common use is to check the load on all of the nodes in the cluster (uptime) to determine whether the node is up or down and report the load on the node. A myriad of other uses range from checking the version of software installed on the nodes, to spot monitoring, to installing packages.

The pdsh command lets you define a list of target hosts to include or exclude and allows you to treat clusters in subgroups when performing operations or to group hosts on the basis of function. Using modules, you can group target hosts by SLURM_JOBID, so you can query nodes that are part of a single job.

Finally, you can use pdsh in conjunction with scripts on a shared workspace and then use the command to run the scripts on target hosts. However, a word of caution: If possible, do not run commands or scripts that have multiline output you would have to reassemble into the proper order.

If you are starting out in the cluster world, or even if you are an experienced administrator, pdsh is a go-to tool for managing and monitoring systems.