Log Everything

Memory Logs

CPU stats are important for a number of obvious reasons, but memory errors are perhaps equally important. However, whereas CPUs are being monitored for performance, memory is being monitored for errors.

The topic of memory monitoring can get a little involved. For more information, you can find one of my earlier articles online. Assuming you have ECC (error-correcting code) memory, you have a few choices about what to monitor:

  • Number of uncorrectable errors (ue_count)
  • (ce_noinfo_count)
  • (sdram_scrub_rate)
  • (seconds_since_reset)
  • Megabytes of memory (size_mb)
  • Total correctable errors (ue_count)
  • (ue_noinfo_count)
  • Type of memory (mem_type)
  • Type of DRAM device (dev_type)

The linked article has a simple bash script that collects information from the /sys filesystem and appends it to a simple CSV file, as well as some simple Python code for reading and processing the CSV file.

Taking the simple collecting script and putting it in a cron job on all nodes is a great way to record memory logs. You can use logger to write it to the system log, or you can create a specific log and then rely on rsyslog to copy it to the central logging server.

Network

Grabbing network logs is a double-edged sword. Having the network information to correlate with CPU, memory, and other logs is wonderful; however, the resulting network logs can be HUGE, to say the least. In the case of TCP traffic, you would need to log the packets coming and going on all of the open ports for every node, which could be a great deal of traffic, depending on the number of open ports and what the applications are doing.

In my opinion, rather than log every single packet on every network, I think it’s better to do two things:

  1. Perform periodic micro-performance tests from one node to another. This is a very simple and quick test between two or more nodes using the main application and storage networks. Ideally, it would be nice to record the 0-byte packet latency and the maximum bandwidth before the user application starts, but it could result in a large delay in starting the user’s application. These tests can be run easily during a prologue job script and even during an epilogue script.
  2. Log network errors. Although this sounds simple, it can be a bit complicated. The errors on the switches and the hosts should be logged in a single place. For example, on the hosts, you can gather packet “errors” using something as simple as ifconfig (which is deprecated, so you might want to use the ip command). A simple script can do this and can be put in a cron job. You can do the same for InfiniBand (IB) connections on the host.

Grabbing switch logs to look for errors is very dependent on the switch manufacturer, so be sure to read the switch manuals and write some simple scripts to grab switch logs periodically. The same is true for IB switches.