Log Everything

To be a good HPC system administrator for today’s environment, you need to be a lumberjack.

 

Oh, I’m a lumberjack, and I’m okay,
I sleep all night and I work all day.

From the Monty Python song “Lumberjack.”

Can’t you just imagine yourself in the wilds of British Columbia swinging your ax, breathing fresh air, sleeping under the stars?!!! I can’t either, but Monty Python’s “Lumberjack” song has a strong message for admins, particularly HPC admins – Log Everything.

Why log everything? Doesn’t that require a great deal of work and storage? The simple answer to these questions is, yes. In fact, you might need to start thinking about a small logging cluster in conjunction with the HPC computational cluster. Such a setup will give you answers to questions.

Answering questions is the cornerstone of running HPC systems. These questions include those from users such as, “Why is my application not running?” or “Why is my application running slow?” or “Why did I run out of space?” It also answers system administrator questions such as, “What commands did the user run?” or “What nodes was the user allocated during their run?” or “Why is the user storing a bunch of Taylor Swift videos?”

If you haven’t read about the principle of Managing Up, you should. One of the keys of this dynamic is anticipating questions your manager might ask, such as something seemingly as simple as, “How’s the cluster running?” or something with a little more meat to it such as “Why isn’t Ms. Johnson’s application running?” or perhaps the targeted question, “How could you screw up so badly?” Implicit in these questions, are questions from your manager’s manager, and on up the chain. Managing up means anticipating these questions or situations that might be encountered up the management chain (answering the “Bob’s” question about what you actually do). More than likely, management is not being abusive, but several people have taken responsibility for spending a great deal of money on your fancy cluster, and they want to know how it’s being utilized and if it’s worth the investment.

The way to answer these questions is to have data. Data-based answers are always better than guesses or suppositions. What’s the best way to have data? Be a lumberjack and log everything.

Logging

Regardless of what you monitor, you need to be a lumberjack and log it. For HPC systems, a number of nodes are likely to be running, even up into the tens of thousands. The metrics for each node need to be monitored and logged.

The first step in logging is deciding how to write the logs. For example, you could write the logs as a non-root user to a file located somewhere on a common cluster filesystem. A simple way to accomplish this is to create a special user, perhaps lumberjack, and have this user write logs to their /home directory that is mounted across the cluster.

The logs written by this user should have file names specific to each node for each entry, which allows you to determine the source of the messages. You should also put a time stamp with each log entry so that you can get a time history of events.

Another good option relative to writing logs to a user directory is to use the really cool Linux tool logger, which allows a user to write a message to the system logs. For example, you could easily run the command

$ logger "Just a notification"

to write a message to the standard system log /var/log/syslog located on each node. Be default, It also writes the time stamp with the log entry. You can specify the log as well, in case you don’t want to write to /var/log/syslog. Just use the -f <file> option, where <file> is the fully qualified path to the log file (just to make sure).

If you haven’t noticed yet, logger writes the messages to the local logs, so each node has its own log. However, you really want to collect the logs in a single location to parse them together; therefore, you need a way to gather the logs from all of the nodes to a central location.

A key to good logging habits is to copy or write logs from remote servers (compute nodes) to a central logging server, so you have everything in one place, making it easier to make sense of what is happening on the server(s). You have several ways to accomplish this, ranging from easy to a bit more difficult.

Remote Logging the Easy Way

The simple way to perform remote logging comes with some risk: Configure a cronjob on every node that periodically copies the node system logs to the centralized log server. The risk is that logs are copied only in the time period specified in the cron job, so if something happens on the node during that time, you won’t have any system logs for that node on the log server.

A simple script for copying the logs would likely use scp to copy the logs securely from the local node to the log server. You can copy whatever logs or files you like. A key consideration is what you name the files on the log server. Most likely, you will want to put the node name in the name of the logfiles. For example, the name might be node001-syslog, which allows you to store the logs without worrying about overwriting files from other nodes.

Another key consideration is to include the time stamp when the log copy occurs, which, again, lets you keep files separate without fear of overwriting and makes the creation of a time history much easier.

Remote Logging with rsyslog

Another popular option is rsyslog (remote syslog), an open source tool for forwarding log messages to a central server using an IP network. It is very configurable using the /etc/rsyslog.conf file and the files in the /etc/rsyslog.d/ directory to define the various configuration options. Because the tool is so configurable and flexible, be sure to read the man pages very carefully.

You can get started fairly easily with rsyslog by using the defaults. On the remote host that collects the logs, you begin by editing the /etc/rsyslog.conf file, uncommenting the following lines:

$ModLoad imtcp
$InputTCPServerRun 514

These lines tell rsyslog to use TCP, which is port 514 by default. After the change, you should restart the rsyslog server.

On every node that is to send its logs to the logging node, you need to make some changes. First, in the file /etc/rsyslog.d/loghost.conf, make sure you have a line such as

*.* @@:514 

where <loghost> is the name of the logging host (use either the IP address or resolvable hostname), the *.* refers to all logging facilities and priorities, the @@ portion tells rsyslog to use TCP for log transfer (an @ alone would tell it to use UDP), and 514 is the TCP port. After this change is made, restart the service on the node and every node that is to send the logs to the logging server.

In the logfiles on the logging server, the hostname of the node will appear, so you can differentiate logs on the basis of hostnames.

You can use either of these approaches, or one that you create, to store all of the system logs in a central location (a logging server). Linux comes with standard logs that can be very useful; alternatively, you might want to think about creating your own logs. In either case, you can log whatever information you feel is needed. The next few sections present some options you might want to consider.

CPU Logs

Many tools for measuring CPU usage are in the /proc filesystem. They range from uptime to top to sysstat to /proc/uptime. Which you use is really up to you; however, you should pick one method and stick with it.

To start, you might consider using uptime and send the result to the system log with logger. Uptime will give you the load averages for the entire node for the past 1, 5, and 15 minutes, so if you have 16 cores, you could see a load average of 16.0. Keep this in mind as you process the logs.

If you have a heterogeneous cluster with different types of nodes, your number of cores could differ. You have to take this into account when you process the logs.

A simple example in using uptime for CPU statistics is to create a simple cron job that runs the utility at some time interval and writes the output with logger. You could write this to the system log (syslog), which is the default, or you could write it to a different system log or a log that you create. The important thing is to pick an approach and stay with it (i.e., don’t mix solutions).

If you want or need more granular data than uptime provides for CPU monitoring, I would suggest using mpstat, which is part of the sysstat package included in many distributions.

The mpstatcommand writes CPU stats to standard output (stdout) for each available processor in the node, starting with CPU 0. It reports a boatload of statistics, including:

  • CPU: Processor number for the output
  • %usr: Percentage of CPU utilization by user applications
  • %nice: Percentage of CPU utilization by user applications using the nice priority
  • %sys: Percentage of CPU utilization at the system (kernel) level, not including the time for servicing hardware or software interrupts
  • %iowait: Percentage of CPU utilization spent idle during which the system had an outstanding I/O request
  • %irq: Percentage of CPU utilization spent servicing hardware interrupts
  • %soft: Percentage of CPU utilization servicing software interrupts
  • %steal: Percentage of CPU utilization spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor
  • %guest: Percentage of CPU utilization spent by the CPU or CPUs to run a virtual processor
  • %gnice: Percentage of CPU utilization spent by the CPU or CPUs to run a guest with a nice priority
  • %idle: Percentage of CPU utilization spent idle while the system did not have an outstanding I/O request

Mpstat allows you to run the command with a specified interval and run it as long as the node is powered on. You can run it as a regular user or as root. However, if you want to write the output to a system logfile, you might have to create a cron job that runs the command once at a specified interval and writes the appropriate part of the output to the system logs with logger.

Depending on what you want, you can get statistics for all of the processors combined or for every processor in the node. If you have enough logging space, I recommend getting statistics for each processor, which allows you to track a number of things, including hung processes, processes “jumping” around processors, or processor hogs (a very technical term, by the way).

One last comment about CPU statistics: Be very careful in choosing an interval for gathering those statistics. For example, do you really want to gather CPU stats every second for every compute node? Unless you do some special configuration, you will be gathering statistics for nodes that aren’t running jobs. Although it could be interesting, at the same time, it could just create a massive amount of data that indicates the node wasn’t doing anything.

If you gather statistics on each core on a node with 40 total cores every second, in one minute you have gathered 2,400 lines of stats (one for each core). If you have 100 nodes, in one minute you have gathered 24,000 lines of stats for the cluster. In one day, this is 34,560,000 lines of stats for the 100 nodes.

You could increase the interval in gathering the statistics to reach a target for the amount of stats gathered, but another option is a little more clever: On each node you could have a cron job that gathers the CPU stats every minute or few minutes (call this the “long-term” CPU stats metric) that are then written to a specific log. Then, in the prologue and epilogue scripts for the job scheduler, you could create or start a cron job that gathers CPU stats more frequently (call this the “short-term” CPU stats metric). When a job is started on the node by the job scheduler, the CPU stats are then written to a different log than the long-term CPU stats, which allows you to grab more refined statistics for jobs. Moreover, you can correlate the CPU stats with the specific job.

Memory Logs

CPU stats are important for a number of obvious reasons, but memory errors are perhaps equally important. However, whereas CPUs are being monitored for performance, memory is being monitored for errors.

The topic of memory monitoring can get a little involved. For more information, you can find one of my earlier articles online. Assuming you have ECC (error-correcting code) memory, you have a few choices about what to monitor:

  • Number of uncorrectable errors (ue_count)
  • (ce_noinfo_count)
  • (sdram_scrub_rate)
  • (seconds_since_reset)
  • Megabytes of memory (size_mb)
  • Total correctable errors (ue_count)
  • (ue_noinfo_count)
  • Type of memory (mem_type)
  • Type of DRAM device (dev_type)

The linked article has a simple bash script that collects information from the /sys filesystem and appends it to a simple CSV file, as well as some simple Python code for reading and processing the CSV file.

Taking the simple collecting script and putting it in a cron job on all nodes is a great way to record memory logs. You can use logger to write it to the system log, or you can create a specific log and then rely on rsyslog to copy it to the central logging server.

Network

Grabbing network logs is a double-edged sword. Having the network information to correlate with CPU, memory, and other logs is wonderful; however, the resulting network logs can be HUGE, to say the least. In the case of TCP traffic, you would need to log the packets coming and going on all of the open ports for every node, which could be a great deal of traffic, depending on the number of open ports and what the applications are doing.

In my opinion, rather than log every single packet on every network, I think it’s better to do two things:

  1. Perform periodic micro-performance tests from one node to another. This is a very simple and quick test between two or more nodes using the main application and storage networks. Ideally, it would be nice to record the 0-byte packet latency and the maximum bandwidth before the user application starts, but it could result in a large delay in starting the user’s application. These tests can be run easily during a prologue job script and even during an epilogue script.
  2. Log network errors. Although this sounds simple, it can be a bit complicated. The errors on the switches and the hosts should be logged in a single place. For example, on the hosts, you can gather packet “errors” using something as simple as ifconfig (which is deprecated, so you might want to use the ip command). A simple script can do this and can be put in a cron job. You can do the same for InfiniBand (IB) connections on the host.

Grabbing switch logs to look for errors is very dependent on the switch manufacturer, so be sure to read the switch manuals and write some simple scripts to grab switch logs periodically. The same is true for IB switches.

Storage

Recording storage logs is very similar to recording network logs. Grabbing each data packet going to and from the storage system and the drives results in a large amount of information, most of which is useless to you. Instead, think about running simple I/O tests in a job’s prologue and epilogue scripts and recording that data. Of course, the results will vary depending on the I/O load, but it’s worth understanding I/O performance when the job is getting ready to run.

In addition to capturing performance information, you can grab I/O performance statistics from the servers and clients. A simple example is NFS. A great tool, nfsiostat, allows you to capture statistics about NFS client and server activity. With respect to clients, you can grab information such as:

  • Number of blocks read or written
  • Number of reads and writes (ops/sec)

Or, you could gather the same information, but with O_DIRECT. With this information, you can get a histogram of the NFS performance of both clients and servers.

In addition to nfsiostat, you can use iostat, which collects lots of metrics on the storage server, such as CPU time, throughput, and I/O request times. You can also use iostat to monitor I/O on client nodes.

Likely, you are already using filesystem tools, so you can easily look for errors in the filesystem logs and collect them (script this). These are logs are specific to a filesystem, so be sure to read the manuals on what is being recorded.

Summary

A number of system administrators are reluctant to log much more than the minimum necessary, primarily for compliance. However, I’m a big believer that having too much information is better than not having enough. More logs means more space used and probably more network traffic, but in the end, you have a set of system logs that you can use to your advantage.

To review, here are four highlights:

  • Log everything (within reason).
  • Put a time stamp on it.
  • Put a node name on every entry.
  • Be a lumberjack, and you’ll be OK.

Tags: logger logger , Logging Logging