Lead Image © Vasyl Nesterov, 123RF.com

Lead Image © Vasyl Nesterov, 123RF.com

Command-line tools for the HPC administrator

Line Items

Article from ADMIN 42/2017
Several sophisticated command-line tools can help you manage and troubleshoot HPC (or other) systems.

The HPC world has some amazing "big" tools that help administrators monitor their systems and keep them running, such as the Ganglia and Nagios cluster monitoring systems. Although they are extremely useful, sometimes it is the small tools that can help debug a user problem or find system issues. Here are a few favorites.


The introduction of sharable objects [1], or "dynamic libraries," has allowed for smaller binaries, less "skew" across binaries, and a reduction in memory usage, among other things. Users, myself included, tend to forget that when code compiles, we only see the size of the binary itself, not the "shared" objects.

For example, the following simple Hello World program, test1, uses PGI compilers (16.10):

write(*,*) "hello world"

Running the ldd command against the compiled program produces the output in Listing 1. If you look at the binary, which is very small, you might think it is the complete story, but after looking at the list of libraries linked to it, you can begin to appreciate what compilers and linkers do for users today.

Listing 1

Show Linked Libraries (ldd)

$ pgf90 test1.f90 -o test1
$ ldd test1
    linux-vdso.so.1 =>  (0x00007fff11dc8000)
    libpgf90rtl.so => /opt/pgi/linux86-64/16.10/lib/libpgf90rtl.so (0x00007f5bc6516000)
    libpgf90.so => /opt/pgi/linux86-64/16.10/lib/libpgf90.so (0x00007f5bc5f5f000)
    libpgf90_rpm1.so => /opt/pgi/linux86-64/16.10/lib/libpgf90_rpm1.so (0x00007f5bc5d5d000)
    libpgf902.so => /opt/pgi/linux86-64/16.10/lib/libpgf902.so (0x00007f5bc5b4a000)
    libpgftnrtl.so => /opt/pgi/linux86-64/16.10/lib/libpgftnrtl.so (0x00007f5bc5914000)
    libpgmp.so => /opt/pgi/linux86-64/16.10/lib/libpgmp.so (0x00007f5bc5694000)
    libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bc5467000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bc524a000)
    libpgc.so => /opt/pgi/linux86-64/16.10/lib/libpgc.so (0x00007f5bc4fc2000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f5bc4dba000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f5bc4ab7000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f5bc46f4000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bc44de000)
    /lib64/ld-linux-x86-64.so.2 (0x000056123e669000)

If the application fails, a good place to look is the list of libraries linked to the binary. If the paths have changed or if you copy the binary from one system to another, you might see a library mismatch. The ldd command is indispensable when chasing down strange issues with libraries.


One of those *nix commands you don't learn at first but that can really save your bacon is find. If you are looking for a specific file or a set of files with a similar name, then find is your friend.

I tend to use find two ways. The first way is fairly straightforward: If I'm looking for a file or a set of files below the directory in which I'm located (pwd), then I use some variation of the following:

$ find . -name "*.dat.2"

The dot right after the find command tells it to start searching from the current directory and then search all directories under that directory.

The -name option lets me specify a "template" that find uses to look for files. In this case, the template is *.dat.2. By using the * wildcard, find will locate any file that ends in .dat.2. If I truly get desperate to find a file, I can go to the root directory (/) and run the same find command.

The second way I run find is to use its output as input to grep. Remember, *nix is designed to have small programs that can pipe input and output from one command to another for complex processing. Here, the command chain

$ find . -name "*.dat.2" | grep -i "xv426"

takes the output from the find command and pipes it into the grep command to look for any file name that contains the string xv426 – be it uppercase or lowercase (-i). The power of *nix lies in this ability to combine find with virtually any other *nix command (e.g., sort, unique, wc, sed).

In *nix, you have more than one way to accomplish a task: Different commands can yield the same end result. Just remember that you have a large number of commands from which to draw; you don't have to write Python or Perl code to accomplish a task that you can accomplish from the command line.

ssh and pdsh

It might sound rather obvious, but two of the tools I rely on the most are a simple secure tool for remote logins, ssh, and a tool that uses ssh to run commands on remote systems in parallel or in groups, pdsh. When clusters first showed up, the tool of choice for remote logins was rsh. It had been around for a while and was very easy to use.

However, it was very insecure because it transmitted data – including passwords – from the host machine to the target machine with no encryption. Therefore, a very simple network sniff could gather lots of passwords. Anyone who used rsh, rlogin, or telnet between systems across the Internet, such as when users logged in to the cluster, left them wide open to attacks and password sniffing. Very quickly people realized that something more secure was needed.

In 1995, researcher Tatu Ylönen created Secure Shell (SSH) because of a password sniffing attack on his network. It gained popularity, and SSH Communications Security was founded to commercialize it. The OpenBSD community grabbed the last open version of SSH and developed it into OpenSSH [2]. After gaining popularity in the early 2000s, the cluster community grabbed it and started using it to help secure clusters.

SSH is extremely powerful. Beyond just remote logins and commands run on a remote node, it can be used for tunneling or forwarding other protocols over SSH. Specifically, you can use SSH to forward X from a remote host to your desktop and copy data from one host to another (scp), and you can use it in combination with rsync to back up, mirror, or copy data.

SSH was a great development for clusters and HPC, because it finally provided a way to log in to systems and send commands securely and remotely. However, SSH could only do this for a single system, and HPC systems can comprise hundreds or thousands of nodes; therefore, admins needed a way to send the same command to a number of nodes or a fixed set of nodes.

In a past article [3], I wrote about a class of tools that accomplishes this goal: parallel shells. The most common is pdsh [4]. In theory, it is fairly simple to use. It uses a specific remote command to run a common command on specified nodes. You have a choice of underlying tools when you build and use pdsh. I prefer to use SSH because of the security if offers.

To create a simple file containing the list of hosts you want pdsh to use by default, enter:

export WCOLL=/home/laytonjb/PDSH/hosts
<C>WCOLL<C> is an environment variable that points to the location of the file that lists hosts. You can put this command in either your home or global <C>.bashrc<C> file.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Small Tools for Managing HPC

    Several very sophisticated tools can be used to manage HPC systems, but it’s the little things that make them hum. Here are a few favorites.

  • More Small Tools for HPC Admins

    We look at  some additional tools that you might find useful when troubleshooting HPC systems .

  • pdsh Parallel Shell

    The pdsh  parallel shell tool lets you run a command across multiple nodes in a cluster.

  • HPC fundamentals
    The pdsh parallel shell is a fundamental HPC tool that lets you run a command across multiple nodes in a cluster.
  • Sharing Data with SSHFS

    Sharing data saves space, reduces data skew, and improves data management. We look at the SSHFS shared filesystem, put it through some performance tests, and show you how to tune it.

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>


		<div class=