Command-line tools for the HPC administrator

Line Items

SSHFS

I've written about SSHFS [5] in the past, and it has to be one of the most awesome filesystem tools I have ever used. SSHFS is a Filesystem in Userspace (FUSE)-based [6] client that mounts and interacts with a remote filesystem as though the filesystem were local (i.e., shared) storage. It uses SSH as the underlying protocol and SFTP [7] as the transfer protocol, so it's as secure as SFTP.

SSHFS can be very handy when working with remote filesystems, especially if you only have SSH access to the remote system. Moreover, you don't need to add or run a special client tool on the client nodes or a special server tool on the storage node; you just need SSH active on your system. Almost all firewalls allow port 22 access, so you don't have to configure anything extra (e.g., NFS or CIFS); you just need one open port on the firewall – port 22. All the other ports can be blocked.

Many filesystems allow encryption of data at rest. Using SSHFS in combination with an encrypted filesystem ensures that your data is encrypted at rest and "over the wires," which prevents packet sniffing within or outside the cluster and is an important consideration in a mobile society in which users want to access their data from multiple places with multiple devices.

A quick glance at SSHFS [8] indicates that the sequential read and write performance is on par with NFS. However, random I/O performance is less efficient than NFS. Fortunately, you can tune SSHFS to reduce the effect of encryption on performance. Furthermore, you can enable compression to improve performance. Using these tuning options, you can recover SSHFS performance so that it matches and even exceeds NFS performance.

vmstat

One *nix command that gets no respect is vmstat [9]; however, it can be an extremely useful command, particularly in HPC. vmstat reports Linux system virtual memory statistics, and although it has several "modes," I find the default mode to be extremely useful. Listing 2 is a quick snapshot of a Linux laptop.

Listing 2

vmstat on a Laptop

01 [laytonjb@laytonjb-Lenovo-G50-45 ~]$ vmstat 1 5
02 procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
03  r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
04  1  0      0 5279852   2256 668972    0    0  1724    25  965 1042 17  9 71  2  0
05  1  0      0 5269008   2256 669004    0    0     0     0 2667 1679 28  3 69  0  0
06  1  0      0 5260976   2256 669004    0    0     0   504 1916  933 25  1 74  0  0
07  2  0      0 5266288   2256 668980    0    0     0    36 4523 2941 29  4 67  0  0
08  0  0      0 5276056   2256 668960    0    0     4     4 9104 6262 36  5 58  0  0

Each line of output corresponds to a system snapshot at a particular time (Table 1), and you can control the amount of time between snapshots. The first line of numbers are the metrics since the system was rebooted. Lines of output after that correspond to current values. A number of system metrics are very important. The first thing to look at is the number of processes (r and b ). If these numbers start moving up, something unusual might be happening on the node, such as processes waiting for run time or sleeping.

Table 1

vmstat Output

vmstat Column Meaning
procs
       r No. of processes in uninterruptible sleep
       b No. of processes waiting for run time
memory
       swpd Amount of virtual memory used
       free Amount of idle memory
       buff Amount of memory used as buffers
       cache Amount of memory used as cache
swap
       si Amount of memory swapped in from disk (blocks/sec)
       so Amount of memory swapped out to disk (blocks/sec)
io
       bi No. of blocks received from a block device (blocks/sec)
       bo No. of blocks sent to a block device (blocks/sec)
system
       in No. of interrupts per second, including the clock
       cs No. of context switches per second
cpu
       us Time spent running non-kernel code (=user time+nice time)
       sy Time spent running kernel code (=system time)
       id Time spent idle
       wa Time spent waiting for I/O
       st Time stolen from a virtual machine

The metrics listed under memory can be useful, particularly as the kernel grabs and releases memory. You shouldn't be too worried about these values unless those in the next section (swap ) are non-zero. If you see non-zero si and so values, excluding the first row, you should be concerned, because it indicates that the system is swapping, and swapping memory to disk can really kill performance. If a user is complaining about performance and you see a node running really slow, with a very large load, then it's a good possibility that the node is swapping.

The metrics listed in the io section are also good to watch. They list blocks either sent to or received from a block device. If both of these numbers are large, the application running on the nodes is likely doing something unusual by reading and writing to the device at the same time, which can hurt performance.

The other metrics can be very useful, as well, but I tend to focus on those mentioned first before scanning the others. You can also send this data to a file for postprocessing or plotting (e.g., for debugging user problems on nodes).

watch

At some point, you will have to debug an application. It might belong to you, or it might belong to another user, but you will be involved. Although debugging can be tedious, you can learn a great deal from it. Lately, one tool I've been using more and more is called watch.

The cool watch [10] tool can help you immensely, just by doing something extremely simple: run a command repeatedly and display the output to stdout. For example, assume a user has an application hanging on a node. One of the first things I want to check is the load on the node (i.e., whether it's very high or very low). Rather than repeatedly typing uptime in a console window as the application executes, I can use watch to do this for me; plus, it will overwrite its previous output so you can observe the system load as it progresses, without looking at infinitely scrolling terminal output.

For a quick example, the simple command

$ watch -n 1 uptime

tells watch to run a command (uptime) every second (-n 1). It will continue to run this command forever unless you interrupt it or kill it. You can change the time interval to whatever you want, keeping in mind that the command being executed could affect system performance. Figure 1 shows a screen capture from my laptop running this command.

Figure 1: Output from the watch -n 1 uptime command.

One useful option to use with watch is -d, which highlights differences between iterations. This option gives you a wonderful way to view the output of time-varying commands like uptime. You can see in Figure 2 that changes are highlighted (I'm not using a color terminal, so they show up as characters with a black background). Notice that the time has changed as well as the first two loads.

Figure 2: Output from the watch -n 1 -d uptime command.

One bit of advice around using watch is to be careful about passing complicated commands or scripts. By default, watch passes commands using sh -c; therefore, you might have to put the command in quotes to make sure it is passed correctly.

You can use watch in conjunction with all kinds of commands. Personally, I use it with uptime to get a feel for what's happening on a particular node with regard to load. I do this after a node has been rebooted to make sure it's behaving correctly. I also use watch with nvidia-smi on a GPU-equipped node, because it is great way to tell whether the application is using GPUs and, if so, lets me see the load on and the temperature of the GPU(s).

One thing I have never tried is using watch in conjunction with the pdsh command. I would definitely use a longer time interval than one second, because it can sometimes take a bit of time to gather all the data from the cluster. However, because pdsh doesn't guarantee that it will return the output in a certain order, I think the output would be jumbled from interval to interval. If anyone tries this, be sure to post a note somewhere. Perhaps you know of a pdsh-like tool that guarantees the output in some order?

An absolute killer use of watch on a node is to use it with tmux [11], a terminal multiplexer (i.e., you can take a terminal window and break it into several panes). If you are on a node writing code or watching code execute, you can create another pane and use watch to track the load on the node or GPU usage and temperatures, which is a great way to tell whether the code is using GPUs and when. If you use the command line, tmux and watch should be a part of your everyday kit.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Small Tools for Managing HPC

    Several very sophisticated tools can be used to manage HPC systems, but it’s the little things that make them hum. Here are a few favorites.

  • More Small Tools for HPC Admins

    We look at  some additional tools that you might find useful when troubleshooting HPC systems .

  • pdsh Parallel Shell

    The pdsh  parallel shell tool lets you run a command across multiple nodes in a cluster.

  • HPC fundamentals
    The pdsh parallel shell is a fundamental HPC tool that lets you run a command across multiple nodes in a cluster.
  • Sharing Data with SSHFS

    Sharing data saves space, reduces data skew, and improves data management. We look at the SSHFS shared filesystem, put it through some performance tests, and show you how to tune it.

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=