Command-line tools for the HPC administrator

Line Items

whereis and which

The $PATH variable in Linux and *nix tells you the directories or paths that the operating system will use when looking for a command. If you run the command voodoo and the result is an error message like can't find voodoo , but you know it is installed on your system, you might have a $PATH problem.

You could look at your $PATH variable with the env command, but I like to use the simple whereis command, which tells you whether a command is in $PATH and where it is located. For example, when I looked for perl (Figure 3), the output told me where the man pages were located, as well as the binary.

Figure 3: Output from the whereis command.

Think about a situation in which your $PATH is munged, and all of a sudden, you can't run simple commands. An easy way to discover the problem is to use whereis. If the command is not in your $PATH, you can now use find to locate it – if it's on the system.

Another useful command, which, is very helpful for determining what version of a command will be run when executed. For example, assume you have more than one GCC compiler on your system. How do you know which one will be used? The simple way is to use which, as shown in Figure 4.

Figure 4: Output from the which command.

One way I use which quite a bit is when I create new modules for lmod, and on more than one occasion, I have damaged my $PATH so that the command for which I'm trying to write a module isn't in the $PATH variable. Therefore, I know I managed to munge something in the module.

I promise you that if you are a system administrator for any kind of *nix system, HPC or otherwise, at some point, whereis and which are going to help you solve a problem. My favorite war story is about a user who managed to erase their $PATH completely on a cluster and could do nothing. The problem was in the user's .bashrc file, where they had basically erased their $PATH in an attempt to add a new path.

lsblk

When I get on a new system, one of the first things I want to know is how the storage is laid out. Also, in the wake of a filesystem issue (e.g., it's not mounted), I want a tool to discover the problem. The simple lsblk command can help in both cases.

As you examine the command, it seems fairly obvious that ls plus blk will "list all block devices" on the system (Figure 5). This is not the same as listing all mounted filesystems, which is accomplished with the mount command (which lists all network filesystems, as well).

Figure 5: Output from the lsblk command.

The default "tree" output shows the partitions of a particular block device. The block device sizes, in human-readable format, are also shown, as is their mount point (if applicable). A useful option is -f, which adds filesystem output to the lsblk output (Figure 6).

Figure 6: Output from the lsblk -f command.

kill

Sometime in your administrative career, you will have to use the kill [12] command, which sends a signal to the application to tell it to terminate. In fact, you can send a host of signals to applications (Table 2). These signals can accomplish a number of objectives with applications, but the most useful is SIGKILL.

Table 2

Process Signals

SIGHUP SIGUSR2 SIGURG
SIGINT SIGPIPE SIGXCPU
SIGQUIT SIGALRM SIGXFSZ
SIGILL SIGTERM SIGVTALRM
SIGTRAP SIGSTKFLT SIGPROF
SIGABRT SIGCHLD SIGWINCH
SIGIOT SIGCONT SIGIO and SIGPOLL
SIGFPE SIGSTOP SIGPWR
SIGKILL SIGSTP SIGSYS
SIGUSR1 SIGTTIN
SIGSEGV SIGTTOU

I call SIGKILL the "extreme prejudice" option. If you have a process that just will not die, it's time to use SIGKILL:

$ kill -9 [PID]

Theoretically, this should end the process specified, but if for some crazy reason the process won't die (terminate), and you need it to die, the only other action I know to take is to shut down the system. Many times this can result in a compromised configuration when the system is restarted, but you might not have much choice.

As with whereis and which, I can promise that you will have to use kill -9 to stop a process. Sometimes, the problem is the result of a wayward user process, and one way to find that process is to use the commands mentioned in this article. For example, you can use the watch command to monitor the load on the system. If the system is supposed to be idle but watch -n 1 uptime shows a reasonably high load, then you might have a hung process taking up resources. Also, you can use watch in a script to find user processes that are still running on a node that isn't accessible to users (i.e., it has been taken out of production). In either case, you can then use kill -9 to end the process(es).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Small Tools for Managing HPC

    Several very sophisticated tools can be used to manage HPC systems, but it’s the little things that make them hum. Here are a few favorites.

  • More Small Tools for HPC Admins

    We look at  some additional tools that you might find useful when troubleshooting HPC systems .

  • pdsh Parallel Shell

    The pdsh  parallel shell tool lets you run a command across multiple nodes in a cluster.

  • HPC fundamentals
    The pdsh parallel shell is a fundamental HPC tool that lets you run a command across multiple nodes in a cluster.
  • Sharing Data with SSHFS

    Sharing data saves space, reduces data skew, and improves data management. We look at the SSHFS shared filesystem, put it through some performance tests, and show you how to tune it.

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=