To be a good HPC system administrator for today’s environment, you need to be a lumberjack.

Log Everything

Oh, I’m a lumberjack, and I’m okay,
I sleep all night and I work all day.

From the Monty Python song “Lumberjack.”

Can’t you just imagine yourself in the wilds of British Columbia swinging your ax, breathing fresh air, sleeping under the stars?!!! I can’t either, but Monty Python’s “Lumberjack” song has a strong message for admins, particularly HPC admins – Log Everything.

Why log everything? Doesn’t that require a great deal of work and storage? The simple answer to these questions is, yes. In fact, you might need to start thinking about a small logging cluster in conjunction with the HPC computational cluster. Such a setup will give you answers to questions.

Answering questions is the cornerstone of running HPC systems. These questions include those from users such as, “Why is my application not running?” or “Why is my application running slow?” or “Why did I run out of space?” It also answers system administrator questions such as, “What commands did the user run?” or “What nodes was the user allocated during their run?” or “Why is the user storing a bunch of Taylor Swift videos?”

If you haven’t read about the principle of Managing Up, you should. One of the keys of this dynamic is anticipating questions your manager might ask, such as something seemingly as simple as, “How’s the cluster running?” or something with a little more meat to it such as “Why isn’t Ms. Johnson’s application running?” or perhaps the targeted question, “How could you screw up so badly?” Implicit in these questions, are questions from your manager’s manager, and on up the chain. Managing up means anticipating these questions or situations that might be encountered up the management chain (answering the “Bob’s” question about what you actually do). More than likely, management is not being abusive, but several people have taken responsibility for spending a great deal of money on your fancy cluster, and they want to know how it’s being utilized and if it’s worth the investment.

The way to answer these questions is to have data. Data-based answers are always better than guesses or suppositions. What’s the best way to have data? Be a lumberjack and log everything.

Logging

Regardless of what you monitor, you need to be a lumberjack and log it. For HPC systems, a number of nodes are likely to be running, even up into the tens of thousands. The metrics for each node need to be monitored and logged.

The first step in logging is deciding how to write the logs. For example, you could write the logs as a non-root user to a file located somewhere on a common cluster filesystem. A simple way to accomplish this is to create a special user, perhaps lumberjack, and have this user write logs to their /home directory that is mounted across the cluster.

The logs written by this user should have file names specific to each node for each entry, which allows you to determine the source of the messages. You should also put a time stamp with each log entry so that you can get a time history of events.

Another good option relative to writing logs to a user directory is to use the really cool Linux tool logger, which allows a user to write a message to the system logs. For example, you could easily run the command

$ logger "Just a notification"

to write a message to the standard system log /var/log/syslog located on each node. Be default, It also writes the time stamp with the log entry. You can specify the log as well, in case you don’t want to write to /var/log/syslog. Just use the -f <file> option, where <file> is the fully qualified path to the log file (just to make sure).

If you haven’t noticed yet, logger writes the messages to the local logs, so each node has its own log. However, you really want to collect the logs in a single location to parse them together; therefore, you need a way to gather the logs from all of the nodes to a central location.

A key to good logging habits is to copy or write logs from remote servers (compute nodes) to a central logging server, so you have everything in one place, making it easier to make sense of what is happening on the server(s). You have several ways to accomplish this, ranging from easy to a bit more difficult.

Remote Logging the Easy Way

The simple way to perform remote logging comes with some risk: Configure a cronjob on every node that periodically copies the node system logs to the centralized log server. The risk is that logs are copied only in the time period specified in the cron job, so if something happens on the node during that time, you won’t have any system logs for that node on the log server.

A simple script for copying the logs would likely use scp to copy the logs securely from the local node to the log server. You can copy whatever logs or files you like. A key consideration is what you name the files on the log server. Most likely, you will want to put the node name in the name of the logfiles. For example, the name might be node001-syslog, which allows you to store the logs without worrying about overwriting files from other nodes.

Another key consideration is to include the time stamp when the log copy occurs, which, again, lets you keep files separate without fear of overwriting and makes the creation of a time history much easier.