System logging for data-based answers

Log Everything

CPU Logs

Many tools for measuring CPU usage are in the /proc filesystem. They range from uptime to top to sysstat to /proc/uptime. Which you use is really up to you; however, you should pick one method and stick with it.

To start, you might consider using uptime and send the result to the system log with logger. uptime will give you the load averages for the entire node for the past 1, 5, and 15 minutes. Keep this in mind as you process the logs.

If you have a heterogeneous cluster with different types of nodes, your number of cores could differ. You have to take this into account when you process the logs.

A simple example in using uptime for CPU statistics is to create a simple cron job that runs the utility at some time interval and writes the output with logger. You could write this to the system log (syslog), which is the default, or you could write it to a different system log or a log that you create. The important thing is to pick an approach and stay with it (i.e., don't mix solutions).

If you want or need more granular data than uptime provides for CPU monitoring, I would suggest using mpstat [5], which is part of the sysstat [6] package included in many distributions.

The mpstat command writes CPU stats to standard output (stdout) for each available processor in the node, starting with CPU 0. It reports a boatload of statistics, including:

CPU: Processor number for the output
%usr: Percentage of CPU utilization by user applications
%nice: Percentage of CPU utilization by user applications using the nice priority
%sys: Percentage of CPU utilization at the system (kernel) level, not including the time for servicing hardware or software interrupts
%iowait: Percentage of CPU utilization spent idle during which the system had an outstanding I/O request
%irq: Percentage of CPU utilization spent servicing hardware interrupts
%soft: Percentage of CPU utilization servicing software interrupts
%steal: Percentage of CPU utilization spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor
%guest: Percentage of CPU utilization spent by the CPU or CPUs to run a virtual processor
%gnice: Percentage of CPU utilization spent by the CPU or CPUs to run a guest with a nice priority
%idle: Percentage of CPU utilization spent idle while the system did not have an outstanding I/O request

mpstat allows you to run the command with a specified interval and run it as long as the node is powered on. You can run it as a regular user or as root. However, if you want to write the output to a system logfile, you might have to create a cron job that runs the command once at a specified interval and writes the appropriate part of the output to the system logs with logger.

Depending on what you want, you can get statistics for all of the processors combined or for every processor in the node. If you have enough logging space, I recommend getting statistics for each processor, which allows you to track a number of things, including hung processes, processes "jumping" around processors, or processor hogs (a very technical term, by the way).

One last comment about CPU statistics: Be very careful in choosing an interval for gathering those statistics. For example, do you really want to gather CPU stats every second for every compute node? Unless you do some special configuration, you will be gathering statistics for nodes that aren't running jobs. Although it could be interesting, at the same time, it could just create a massive amount of data that indicates the node wasn't doing anything.

If you gather statistics on each core on a node with 40 total cores every second, in one minute you have gathered 2,400 lines of stats (one for each core). If you have 100 nodes, in one minute you have gathered 24,000 lines of stats for the cluster. In one day, this is 34,560,000 lines of stats for the 100 nodes.

You could increase the interval in gathering the statistics to reach a target for the amount of stats gathered, but another option is a little more clever: On each node you could have a cron job that gathers the CPU stats every minute or few minutes (call this the "long-term" CPU stats metric) that are then written to a specific log. Then, in the prologue and epilogue scripts for the job scheduler, you could create or start a cron job that gathers CPU stats more frequently (call this the "short-term" CPU stats metric). When a job is started on the node by the job scheduler, the CPU stats are then written to a different log than the long-term CPU stats, which allows you to grab more refined statistics for jobs. Moreover, you can correlate the CPU stats with the specific job.

Memory Logs

CPU stats are important for a number of obvious reasons, but memory errors are perhaps equally important. However, whereas CPUs are being monitored for performance, memory is being monitored for errors.

The topic of memory monitoring can get a little involved. For more information, you can find one of my earlier articles online [7]. Assuming you have error-correcting code (ECC) memory, you have a few choices about what to monitor:

Number of uncorrectable errors (ue_count)
(ce_noinfo_count)
(sdram_scrub_rate)
(seconds_since_reset)
Megabytes of memory (size_mb)
Total correctable errors (ue_count)
(ue_noinfo_count)
Type of memory (mem_type)
Type of DRAM device (dev_type)

The linked article has a simple bash script that collects information from the /sys filesystem and appends it to a simple CSV file, as well as some simple Python code for reading and processing the CSV file.

Taking the simple collecting script and putting it in a cron job on all nodes is a great way to record memory logs. You can use logger to write it to the system log, or you can create a specific log and then rely on rsyslog to copy it to the central logging server.

Network

Grabbing network logs is a double-edged sword. Having the network information to correlate with CPU, memory and other logs is wonderful; however, the resulting network logs can be HUGE, to say the least. In the case of TCP traffic, you would need to log the packets coming and going on all of the open ports for every node, which could be a great deal of traffic, depending on the number of open ports and what the applications are doing.

In my opinion, rather than log every single packet on every network, I think it's better to do two things:

1. Perform periodic micro-performance tests from one node to another. This is a very simple and quick test between two or more nodes using the main application network and storage networks. Ideally, it would be nice to record the zero-byte packet latency and the maximum bandwidth before the user application starts, but it could result in a large delay in starting the user's application. These tests can be run easily during a prologue job script and even during an epilogue script.

2. Log network errors. Although this sounds simple, it can be a bit complicated. The errors on the switches and the hosts should be logged in a single place. For example, on the hosts, you can gather packet "errors" using something as simple as ifconfig (which is deprecated, so you might want to use the ip command). A simple script can do this and can be put in a cron job. You can do the same for InfiniBand (IB) connections on the host.

Grabbing switch logs to look for errors is very dependent on the switch manufacturer, so be sure to read the switch manuals and write some simple scripts to grab switch logs periodically. The same is true for IB switches.

« Previous 1 2 3 Next »