It’s the Little Things

Several very sophisticated tools can be used to manage HPC systems, but it’s the little things that make them hum. Here are a few favorites.

The HPC world has some amazing “big” tools that help administrators monitor their systems and keep them running, such as the Ganglia and Nagios cluster monitoring systems. Although they are extremely useful, sometimes it is the smaller tools that can help debug a user problem or find system issues.

ldd

The introduction of sharable objects, or “dynamic libraries,” has allowed for smaller binaries, less “skew" across binaries, and a reduction in memory usage, among other things. Users, myself included, tend to forget that when code is compiled, we only see the size of the binary itself, not the “shared” objects.

For example, the following simple Hello World program, called test1, uses the PGI compilers (16.10). (Because all good HPC developers should be writing in Fortran, that’s what I use in this example.)

PROGRAM HELLOWORLD
write(*,*) "hello world"
END

Running the ldd command against the compiled program produces the output in Listing 1. If you look at the binary, which is very small, you might think it is the complete binary. After looking at the list of libraries linked to it, though, you can begin to appreciate what compilers and linkers do for users today.

Listing 1: Show Linked Libraries

laytonjb@laytonjb-Lenovo-G50-45 ~]$ pgf90 test1.f90 -o test1
[laytonjb@laytonjb-Lenovo-G50-45 ~]$ ldd test1
        linux-vdso.so.1 =>  (0x00007fff11dc8000)
        libpgf90rtl.so => /opt/pgi/linux86-64/16.10/lib/libpgf90rtl.so (0x00007f5bc6516000)
        libpgf90.so => /opt/pgi/linux86-64/16.10/lib/libpgf90.so (0x00007f5bc5f5f000)
        libpgf90_rpm1.so => /opt/pgi/linux86-64/16.10/lib/libpgf90_rpm1.so (0x00007f5bc5d5d000)
        libpgf902.so => /opt/pgi/linux86-64/16.10/lib/libpgf902.so (0x00007f5bc5b4a000)
        libpgftnrtl.so => /opt/pgi/linux86-64/16.10/lib/libpgftnrtl.so (0x00007f5bc5914000)
        libpgmp.so => /opt/pgi/linux86-64/16.10/lib/libpgmp.so (0x00007f5bc5694000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bc5467000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bc524a000)
        libpgc.so => /opt/pgi/linux86-64/16.10/lib/libpgc.so (0x00007f5bc4fc2000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f5bc4dba000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f5bc4ab7000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f5bc46f4000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bc44de000)
        /lib64/ld-linux-x86-64.so.2 (0x000056123e669000)

If the application fails, a good place to look to is the list of libraries that are linked to the binary. If the paths are changed or if you copy the binary from one system to another, you might see issues with a library mismatch. The ldd command is indispensable when chasing down strange issues with libraries. “ldd – Don’t run an HPC system without it.”

find

One of those *nix commands you don't learn at first but that can really save your bacon is find. If you are looking for a specific file or a set of files with a similar name, then find is your friend.

I tend to use find two ways. The first way is fairly straightforward: If I'm looking for a file or a set of files below the directory in which I’m located (pwd), then I use some variation of the following:

$ find . -name "*.dat.2"

The dot right after the command tells find to start searching from the current directory, then search all directories under the current directory.

The -name options allows you to specify a “template” that find uses to look for files. In this case, the template is *.dat.2, which means find will locate any file that ends in .dat.2 (note the use of the * wildcard). If I truly get desperate to find a file, I'll go to the root directory (/) and run the same find command.

The second way I run find is to use its output as input to grep. Remember, *nix is designed to have small programs that can pipe from one command into another to create something complex. Here, the command chain

$ find . -name "*.dat.2" | grep -i "xv426"

takes the output from the find command and pipes it into the grep command to look for any file name that contains the string xv426, whether it be uppercase or lowercase (the -i option).

You can combine find with other commands, such as sort, unique, wc, sed, or virtually any other *nix command. The power of *nix lies in this ability.

You can use the find command in other ways, or you can use different commands that accomplish the same thing (in *nix, there is more than one way to do it). Just remember you have a large number of commands from which to draw that can be used to achieve a goal. You don't have to write Python or Perl code to accomplish a task that you can accomplish with existing commands.

SSH and PDSH

It might sound rather obvious, but two of the tools I rely on the most are a simple secure tool for remote logins, ssh, and a tool that uses ssh to run commands on remote systems in parallel or in groups, pdsh. When clusters began, the tool of choice for remote logins was rsh. It had been around for a while and was very easy to use. However, it was very insecure.

It transmitted data from the host machine to the target machine with no encryption – including passwords. Therefore, a very simple network sniff could gather lots of passwords. Using rsh, rlogin, or telnet between systems across the Internet, such as when users logged into the cluster, left you wide open to attacks and password sniffing. Very quickly people realized that something more secure was needed.

In 1995, researcher Tatu Ylönen created Secure Shell (SSH) because of a password sniffing attack on his network. It gained popularity, and SSH Communications Security was founded to commercialize it. OpenBSD community grabbed the last open version of SSH and developed it into OpenSSH. After gaining popularity in the early 2000s, the cluster community grabbed it and started using it to help secure clusters.

SSH is extremely powerful. Beyond just remote logins and running commands on a remote node, it can be used for tunneling or forwarding other protocols over SSH. Specifically, you can use SSH to forward X from a remote host to your desktop and copy data from one host to another (scp), and you can use it in combination with rsync to back up, mirror, or copy data.

SSH was a great development for clusters and HPC, because there was finally a way to log into systems and send commands securely and remotely. However, SSH could only do this for a single system. HPC systems can comprise hundreds or thousands of nodes, which needed a way to send the same command to a number of nodes or a fixed set of nodes.

In a past article, I wrote about a class of tools that accomplishes this goal. These tools are parallel shells, and a number of them meet different needs. The most common is pdsh. In theory, it is fairly simple. It uses a specific remote command to run the same command on the specified nodes. You have a choice of underlying tools to use when you build and use pdsh. I prefer to use SSH because of the security if offers.

To create a simple file containing the list of hosts you want PDSH to use as default:

export WCOLL=/home/laytonjb/PDSH/hosts

WCOLL is an environment variable that points to the location of the file with the list of hosts. You can put this command in your .bashrc file in your home directory, or you can put it in the global .bashrc file.

SSHFS

I've written about SSHFS in the past. It has to be one of the most awesome filesystem tools I have ever used. It is a FUSE-based userspace client that mounts and interacts with a remote filesystem as though the filesystem were local (i.e., shared storage). It uses SSH as the underlying protocol and SFTP as the transfer protocol, so it’s as secure as SFTP.

SSHFS can be very handy when working with remote filesystems, especially if you only have SSH access to the remote system. Moreover, you don’t need to add or run a special client tool on the client nodes or a special server tool on the storage node. You just need SSH active on your system. Almost all firewalls allow port 22 access, so you don’t have to configure anything extra (e.g., NFS or CIFS); you just need one open port on the firewall – port 22. All the other ports can be blocked.

Many filesystems have encryption of data at rest. Using SSHFS in combination with an encrypted filesystem ensures that your data is encrypted at rest and “over the wires,” which prevents packet sniffing of data within or outside the cluster, an important consideration in our current mobile society where users want to access their data from multiple places with multiple devices.

A quick glance at SSHFS performance indicates that the sequential read and write performance is on par with NFS. However, random I/O performance is less efficient than NFS. Fortunately, you can tune SSHFS to reduce the effect of encryption on performance. Furthermore, you can enable compression to improve performance as well. Using these tuning options, you can recover SSHFS performance so that it matches and even exceeds NFS performance.

vmstat

One of those *nix commands that gets no respect is vmstat. However, it can be an extremely useful command, particularly for HPC. Vmstat reports Linux system virtual memory statistics. Although it has several “modes,” I find the default mode to be extremely useful. Listing 2 is a quick snapshot of a Linux laptop.

Listing 2: vmstat on a Laptop

[laytonjb@laytonjb-Lenovo-G50-45 ~]$ vmstat 1 5
procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 5279852   2256 668972    0    0  1724    25  965 1042 17  9 71  2  0
 1  0      0 5269008   2256 669004    0    0     0     0 2667 1679 28  3 69  0  0
 1  0      0 5260976   2256 669004    0    0     0   504 1916  933 25  1 74  0  0
 2  0      0 5266288   2256 668980    0    0     0    36 4523 2941 29  4 67  0  0
 0  0      0 5276056   2256 668960    0    0     4     4 9104 6262 36  5 58  0  0

Each line of output corresponds to a system snapshot at a particular time (Table 1), and you can control the amount of time between snapshots. The first line of numbers are the metrics since the system was rebooted. Lines of output after that correspond to their current values. A number of system metrics are very important. The first thing is to look at is the number of processes (r and b). If these numbers start moving up, something unusual might be happening on the node, such as processes waiting for run time or sleeping.

Table 1: vmstat Output

vmstat Column Meaning
procs  
  b No. of processes waiting for run time
  r No. of processes in uninterruptible sleep
memory  
  swpd Amount of virtual memory used
  free Amount of idle memory
  buff Amount of memory used as buffers
  cache Amount of memory used as cache
swap  
  si Amount of memory swapped in from disk (blocks/sec)
  so Amount of memory swapped out to disk (blocks/sec)
io  
  bi No. of blocks received from a block device (blocks/sec)
  bo No. of blocks sent to a block device (blocks/sec)
system  
  in No. of interrupts per second, including the clock
  cs No. of context switches per second
cpu  
  us Time spent running non-kernel code (=user time + nice time)
  sy Time spent running kernel code (=system time)
  id Time spent idle
  wa Time spent waiting for I/O
  st Time stolen from a virtual machine

The metrics listed as memory can be useful, particularly as the kernel grabs and releases memory. You shouldn't be too worried about these values unless the values in the next section (swap), are non-zero. If you see non-zero si and so values, excluding the first row, you should be concerned, because it indicates that the system is swapping, and swapping memory to disk and can really kill performance. If a user is complaining about performance and you see a node running really slowly with a very large load, then it’s a good possibility that the node is swapping.

The metrics listed in the io section are also good to watch. They list either blocks sent to a block device or blocks received from a block device. If these numbers are both large, the application running on the nodes is likely doing something unusual by reading and writing to the device at the same time. This situation too can hurt performance.

The other metrics can be very useful, but I tend to focus on those mentioned first before scanning the others. You can also send this data to a file for later postprocessing or plotting – for example, for debugging user problems on nodes.

Tags: tools tools