pdsh Parallel Shell

The pdsh parallel shell tool lets you run a command across multiple nodes in a cluster.

When I was preparing for my qualification exams in graduate school, I talked with fellow grad students about their experiences. I remember one fellow student said something like, "When in doubt, focus on the fundamentals"; that is, go back to first principles to solve problems. I have always remembered his comment and I try to apply it when I can.

For HPC, one of the fundamentals is being able to run a command across a number of nodes in a cluster. A parallel shell is a simple but powerful tool that allows you to do so on designated (or all) nodes in the cluster, so you do not have to log in to each node and run the same command. This single tool has an infinite number of ways to be useful, but I like to use it when performing administrative tasks, such as:

  • discovering the status of the nodes in the cluster quickly,
  • checking the versions of particular software packages on each node,
  • checking the OS version on all nodes,
  • checking the kernel version on all nodes,
  • searching the system logs on each node (if you do not store them centrally),
  • examining the CPU usage on each node,
  • examining local I/O (if the nodes do any local I/O),
  • checking whether any nodes are swapping,
  • spot-monitoring the compute nodes, and
  • debugging.

This list is just the short version; the real list is extensive. Anything you want to do on a single node can be done on a large number of nodes using a parallel shell tool. However, for those that might be asking if they can use parallel shells on their 50,000-node clusters, the answer is that you can, but the time skew in the results will be large enough that the results might not be useful (which is a completely different subject). Parallel shells are more practical when used on a smaller number of nodes, on specific nodes (e.g., those associated with a specific job in a resource manager), or for gathering information that varies somewhat slowly. However, some techniques will allow you to run parallel commands on a large number of nodes.

Among the parallel shells available, many are written in Python, which has become a very popular DevOps tool. Some of the tools are perhaps not as appropriate or useful for HPC but may be good for other tasks. The shell I typically use – and that I have found a large number of other people using – is pdsh.

Introduction to pdsh

The pdsh tool is arguably one of the most popular parallel shells. It allows you to run commands on multiple nodes using only SSH, so the data transmission is encrypted. [It is a good practice to encrypt all data, whether it is “on the wire” or “at rest,” or within the cluster or from outside the cluster.] Only the client nodes need to have ssh installed, which is pretty typical for HPC systems. However, you need the ability to ssh to any node without a password (i.e., passwordless ssh). Using ssh inside the cluster should alleviate your fears about not using passwords.

Building and Installing pdsh

Building and installing pdsh is really simple if you have built code using GNU’s autoconfigure before:

./configure --with-ssh --without-rsh
make
make install

These three lines put the binaries into /usr/local/, which is fine for testing purposes. For production work, I would put them in /opt or the like; just be sure the directory is in your path. Also, to make life easier, I put the directory on a filesystem that is shared with the compute nodes, which allows pdsh to run regardless of what system you are using.

You might notice that I used the --without-rsh option in the configure command. By default, pdsh uses rsh, which is not secure and should never be used. Notice the available rcmd modules (rcmd is the “remote command” used by pdsh) at the bottom of Listing 1 states that only ssh and exec are available. If rsh wasn't excluded, it would be listed here, too, and it would be the default; however, it is highly recommended that you not build pdsh with rsh because it is such a security hole.

Listing 1: pdsh Options

$ pdsh -v
pdsh: invalid option -- 'v'
Usage: pdsh [-options] command ...
-S                return largest of remote command return values
-h                output usage menu and quit
-V                output version information and quit
-q                list the option settings and quit
-b                disable ^C status feature (batch mode)
-d                enable extra debug information from ^C status
-l user           execute remote commands as user
-t seconds        set connect timeout (default is 10 sec)
-u seconds        set command timeout (no default)
-f n              use fanout of n nodes
-w host,host,...  set target node list on command line
-x host,host,...  set node exclusion list on command line
-R name           set rcmd module to name
-M name,...       select one or more misc modules to initialize first
-N                disable hostname: labels on output lines
-L                list info on all loaded modules and exit
available rcmd modules: ssh,exec (default: ssh)

If you happened to build pdsh with rsh and do not or cannot rebuild it, you can override rsh and make ssh the default by adding the following line to your .bashrc file:

export PDSH_RCMD_TYPE=ssh

Be sure to source your .bashrc file (e.g., source .bashrc) to set the environment variable. You can also log out and log back in.

If for some reason you see the following when you try running pdsh, then you have built it with rsh:

$ pdsh -w 192.168.1.250 ls -s
pdsh@home4: 192.168.1.250: rcmd: socket: Permission denied

You can either rebuild pdsh without rsh or use the environment variable in your .bashrc file (or both).

First pdsh Commands

A quick test ensures that pdsh is working correctly. This simple test gets the kernel version of a different node using the IP address of the other node.

$ pdsh -w 192.168.1.250 uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

The -w option means that the IP address of the target’s node(s) is specified. In this case, the IP address of the target remote node is listed (192.168.1.250). Specifically, uname -r is the command to be run. Finally, notice that pdsh output starts the with the node name 192.168.1.250 followed by the output of the command or, as in this case, an error message.

In the off chance you need to mix rcmd modules in a single command, you can specify which module to use on the pdsh command line. For example, the command below uses ssh:

$ pdsh -w ssh:laytonjb@192.168.1.250 uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

You just put the specific rcmd module before the node name. In this case, ssh.

A very common use of pdsh is to use the WCOLL environment variable that points to the file with a list of hosts. If you don't specify a host list on the command line, pdsh will use the host list indicated by WCOLL. As an example, I created a pdsh subdirectory with a file named hosts that lists the default hosts:

$ mkdir PDSH
$ cd PDSH
$ vi hosts
$ more hosts
192.168.1.4
192.168.1.250

For this specific case, only two nodes are listed: 192.168.1.4 and 192.168.1.250. The first node is the cluster head node, and the second is the first test compute node. The hosts in the file can be specified in a comma-separated list, if desired. Just be sure not to put a blank line at the end of the file because pdsh will think it is a host and try to connect to it. You can put the WCOLL environment variable in the .bashrc file:

export WCOLL=/home/laytonjb/PDSH/hosts

As before, you can source your .bashrc file, or you can log out and log back in to activate it after you create it.

Ideally, your /home directory is shared with all nodes in the cluster, which means you only have to edit .bashrc once; otherwise, you have to log in to every node and update the local .bashrc. (What a pain!)

Specifying Hosts

Specifying a list of nodes can be done in several ways, which I will not list here. The pdsh site discusses virtually all of them, some of which are pretty handy.

The simplest way to specify the nodes is on the command line with the -w option, separated by commas:

$ pdsh -w 192.168.1.4,192.168.1.250 uname -r
192.168.1.4: 2.6.32-696.30.1.el6.x86_64
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

Or, you can use a range of hosts on the command line:

$ pdsh -w host[1-11] uname -r
$ pdsh -w host[1-4,8-11] uname -r

In the first case, pdsh expands the host range to host1, host2, host3, etc., through host11. In the second case pdsh expands the host list to host1, host2, host3, host4, host8, host9, etc., through host11. The pdsh website has more information on hostlist expressions. Being able to specify a range of hosts is very convenient if you have a large number of nodes.

Even if you use a range of hosts, specifying the exact host targets can result in a long command. Another option is to have pdsh read the hosts from a file other than the WCOLL environment variable:

$ pdsh -w ^/tmp/hosts uptime
192.168.1.4:  15:51:39 up  8:35, 12 users,  load average: 0.64, 0.38, 0.20
192.168.1.250:  15:47:53 up 2 min,  0 users,  load average: 0.10, 0.10, 0.04
$ more /tmp/hosts
192.168.1.4
192.168.1.250

This command tells pdsh to take the hostnames from the file listed after -w ^ (no space between the ^ and the file name), which is /tmp/hosts here.

Using files that contain lists of nodes is a great way to examine subsets of nodes, especially for larger systems. For example, you could create a list of hosts in one file for the first rack, a second list in a different file for the second rack, and so on. A command could then be run that uses the specific file that lists the hosts in a specific rack.

The subsets do not have to be divided on a rack basis. The lists can be based on node type or cluster function, such as putting all I/O nodes or all login nodes into a single file; you are not limited as to what you can do with these files. This method allows you to use pdsh with very large systems by breaking it to subclusters that are more manageable; then, you can run the same command on each of the subclusters.

The following example shows how to use several host files at once:

$ more /tmp/hosts
192.168.1.4
$ more /tmp/hosts2
192.168.1.250
$ pdsh -w ^/tmp/hosts,^/tmp/hosts2 uname -r
192.168.1.4: 2.6.32-696.30.1.el6.x86_64
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

Notice that both nodes respond to the command. Hosts can also be excluded from a list:

$ pdsh -w -192.168.1.250 uname -r
192.168.1.4: 2.6.32-696.30.1.el6.x86_64

The hyphen in front of 192.168.1.250 means it is excluded from the command, so the output is only from 192.168.1.4. Nodes listed in a file can also be excluded from the pdsh command:

$ pdsh -w -^/tmp/hosts2  uname -r
192.168.1.4: 2.6.32-696.30.1.el6.x86_64

By using the hyphen in front of the file name, the nodes listed in the file are excluded. In this case /tmp/hosts2, which contains 192.168.1.250, is not included in the output.

Alternatively, you can use the -x option combined with a hostname or a list of hostnames to be excluded:

$ pdsh -x 192.168.1.4 uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64
$ pdsh -x ^/tmp/hosts uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64
$ more /tmp/hosts
192.168.1.4

The first command excludes 192.168.1.4 from the default list of hosts. The second command excludes the host in /tmp/hosts from the output.

The combinations of these options are very powerful, allowing you to set up files with hosts for specific subsets of nodes, such as racks, or to create node lists on the basis of node function, such as login or storage. If nodes are down, you can exclude them, if you like. The point is, you should think about common node combinations and include them in specific files. Simple bash scripts could also allow you to specify combinations of node files.

More Useful pdsh Options

The previous examples are fairly simple. To better understand the breadth of options to be used with pdsh, I will show you some different commands. Listing 2 illustrates how to run a complex pdsh command. Before getting into the details, notice that each output line from pdsh lists the node followed by the response from the command.

Listing 2: Complex pdsh Command 1

$ pdsh `cat /proc/cpuinfo | grep bogomips`
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23

Notice that the entire command is in backquotes, meaning the entire command is run on each node. This includes the first part, cat /proc/cpuinfo, whose output is piped to the second part of the command, grep bogomips. Using the backquotes allows you to run complex commands on the target nodes.

For the particular command in Listing 2, the value of bogomips differs for each node because the nodes are different: The first node has eight cores (four cores and four Hyper-Threading cores), whereas the second node has four cores. Consequently, the bogomips value should be different for the two nodes.

Listing 3 is a variation of the command in Listing 2. Notice that the entire command is not contained in backquotes; rather, only the first part of the command is contained in single quotes. When you run this pdsh command, the part in the quotes is run first on all targeted nodes. After the command returns, the output is piped through the second part of the command, grep bogomips, which is executed on the node where pdsh was run. The point of these command variations is to show that you need to be careful how you construct a command so you understand the output.

Listing 3: Complex pdsh Command 2

$ pdsh 'cat /proc/cpuinfo' | grep bogomips
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.4: bogomips   : 6998.13
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23

A very important item to note is that pdsh does not guarantee that the output is returned in any certain order. If 20 nodes are targeted in the list, the output from pdsh will not necessarily start with node 1 and increase incrementally to node 20. Listing 4 is an example of a vmstat command run on two nodes. The command should run twice on each node in one-second intervals.

Listing 4: Commingled Output

$ pdsh vmstat 1 2
192.168.1.4:  procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
192.168.1.4:   r  b swpd   free   buff  cache    si   so   bi    bo    in   cs us sy  id wa st
192.168.1.4:   1  0    0 30198704 286340 751652   0    0    2     3    48   66  1  0  98  0  0
192.168.1.250: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
192.168.1.250:  r  b swpd   free   buff  cache    si   so   bi    bo    in   cs us sy  id wa st
192.168.1.250:  0  0    0 7248836  25632  79268    0    0   14     2    22   21  0  0  99  0  0
192.168.1.4:    1  0    0 30198100 286340 751668   0    0    0     0   412  735  1  0  99  0  0
192.168.1.250:  0  0    0 7249076  25632  79284    0    0    0     0    90   39  0  0 100  0  0

At first glance, it looks like the output is from the first node but then the output from the second node creeps in. A command with multiple output lines cannot guarantee the order of the output. If you really have to run a command across the target nodes with multiple lines of output, the only real choice is to put all of the output into a file and edit it to rearrange the lines so they are in the correct order.

A word of caution, though: If the command produces multiple output lines, for example three lines, it is possible the output lines from a single node will arrive out of order; for example line 3 could arrive before line 2. Ideally having a tag on each line of output would allow it to be reassembled much more easily.

Another technique for using pdsh is to run scripts on each node. For example, in previous articles on processor and memory metrics and process, network, and disk metrics, scripts were designed to create the metrics. With some simple modifications, these scripts can return a single line of output. Putting the scripts in a central location for each node allows pdsh to run the script on the target nodes.

pdsh Modules

Earlier I mentioned how pdsh uses rcmd modules by default to access nodes. The developers have extended this to create modules for various specific situations. The pdsh modules page lists other modules that can be built as part of pdsh, which currently includes:

  • rcmd/rsh
  • rcmd/ssh
  • rcmd/mrsh (uses munge authentication)
  • rcmd/xcpu
  • misc/genders (node selection using libgenders)
  • misc/nodeupdown (uses nodeupdown library)
  • misc/machines (provides an option for a flat-file list of hosts)
  • Slurm (list of targets built from SLURM_JOBID or -j jobid)
  • misc/dshgroup (list of targets built from dsh-style “group” files)
  • netgroup (list of targets to be built from netgroups)

These modules allow pdsh to do specific things. For example, the Slurm module allows you to run the command only on nodes specified by currently running Slurm jobs. When pdsh is run with the Slurm module, it will read the list of nodes from the SLURM_JOBID environment variable. You can also run pdsh with the -j jobid option, and it will get the list of hosts from the jobid specified.

Summary

A tool that allows you to run commands on a range of nodes is probably the most fundamental tool an HPC admin can use. Even for experienced admins, such an easy-to-use tool can help you understand quickly the state of your system. Arguably, the most popular parallel shell is pdsh. It is easy to use and flexible and has very useful modules to extend its capability.

The pdsh tool can be used on the cluster in a number of ways. An extremely common use is to check the load on all of the nodes in the cluster (uptime) to determine whether the node is up or down and report the load on the node. A myriad of other uses range from checking the version of software installed on the nodes, to spot monitoring, to installing packages.

The pdsh command lets you define a list of target hosts to include or exclude and allows you to treat clusters in subgroups when performing operations or to group hosts on the basis of function. Using modules, you can group target hosts by SLURM_JOBID, so you can query nodes that are part of a single job.

Finally, you can use pdsh in conjunction with scripts on a shared workspace and then use the command to run the scripts on target hosts. However, a word of caution: If possible, do not run commands or scripts that have multiline output you would have to reassemble into the proper order.

If you are starting out in the cluster world, or even if you are an experienced administrator, pdsh is a go-to tool for managing and monitoring systems.