Several very sophisticated tools can be used to manage HPC systems, but it’s the little things that make them hum. Here are a few favorites.

 

It’s the Little Things

The HPC world has some amazing “big” tools that help administrators monitor their systems and keep them running, such as the Ganglia and Nagios cluster monitoring systems. Although they are extremely useful, sometimes it is the smaller tools that can help debug a user problem or find system issues.

ldd

The introduction of sharable objects, or “dynamic libraries,” has allowed for smaller binaries, less “skew" across binaries, and a reduction in memory usage, among other things. Users, myself included, tend to forget that when code is compiled, we only see the size of the binary itself, not the “shared” objects.

For example, the following simple Hello World program, called test1, uses the PGI compilers (16.10). (Because all good HPC developers should be writing in Fortran, that’s what I use in this example.)

PROGRAM HELLOWORLD
write(*,*) "hello world"
END

Running the ldd command against the compiled program produces the output in Listing 1. If you look at the binary, which is very small, you might think it is the complete binary. After looking at the list of libraries linked to it, though, you can begin to appreciate what compilers and linkers do for users today.

Listing 1: Show Linked Libraries

laytonjb@laytonjb-Lenovo-G50-45 ~]$ pgf90 test1.f90 -o test1
[laytonjb@laytonjb-Lenovo-G50-45 ~]$ ldd test1
        linux-vdso.so.1 =>  (0x00007fff11dc8000)
        libpgf90rtl.so => /opt/pgi/linux86-64/16.10/lib/libpgf90rtl.so (0x00007f5bc6516000)
        libpgf90.so => /opt/pgi/linux86-64/16.10/lib/libpgf90.so (0x00007f5bc5f5f000)
        libpgf90_rpm1.so => /opt/pgi/linux86-64/16.10/lib/libpgf90_rpm1.so (0x00007f5bc5d5d000)
        libpgf902.so => /opt/pgi/linux86-64/16.10/lib/libpgf902.so (0x00007f5bc5b4a000)
        libpgftnrtl.so => /opt/pgi/linux86-64/16.10/lib/libpgftnrtl.so (0x00007f5bc5914000)
        libpgmp.so => /opt/pgi/linux86-64/16.10/lib/libpgmp.so (0x00007f5bc5694000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bc5467000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bc524a000)
        libpgc.so => /opt/pgi/linux86-64/16.10/lib/libpgc.so (0x00007f5bc4fc2000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f5bc4dba000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f5bc4ab7000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f5bc46f4000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bc44de000)
        /lib64/ld-linux-x86-64.so.2 (0x000056123e669000)

If the application fails, a good place to look to is the list of libraries that are linked to the binary. If the paths are changed or if you copy the binary from one system to another, you might see issues with a library mismatch. The ldd command is indispensable when chasing down strange issues with libraries. “ldd – Don’t run an HPC system without it.”

find

One of those *nix commands you don't learn at first but that can really save your bacon is find. If you are looking for a specific file or a set of files with a similar name, then find is your friend.

I tend to use find two ways. The first way is fairly straightforward: If I'm looking for a file or a set of files below the directory in which I’m located (pwd), then I use some variation of the following:

$ find . -name "*.dat.2"

The dot right after the command tells find to start searching from the current directory, then search all directories under the current directory.

The -name options allows you to specify a “template” that find uses to look for files. In this case, the template is *.dat.2, which means find will locate any file that ends in .dat.2 (note the use of the * wildcard). If I truly get desperate to find a file, I'll go to the root directory (/) and run the same find command.

The second way I run find is to use its output as input to grep. Remember, *nix is designed to have small programs that can pipe from one command into another to create something complex. Here, the command chain

$ find . -name "*.dat.2" | grep -i "xv426"

takes the output from the find command and pipes it into the grep command to look for any file name that contains the string xv426, whether it be uppercase or lowercase (the -i option).

You can combine find with other commands, such as sort, unique, wc, sed, or virtually any other *nix command. The power of *nix lies in this ability.

You can use the find command in other ways, or you can use different commands that accomplish the same thing (in *nix, there is more than one way to do it). Just remember you have a large number of commands from which to draw that can be used to achieve a goal. You don't have to write Python or Perl code to accomplish a task that you can accomplish with existing commands.