Log Everything

Storage

Recording storage logs is very similar to recording network logs. Grabbing each data packet going to and from the storage system and the drives results in a large amount of information, most of which is useless to you. Instead, think about running simple I/O tests in a job’s prologue and epilogue scripts and recording that data. Of course, the results will vary depending on the I/O load, but it’s worth understanding I/O performance when the job is getting ready to run.

In addition to capturing performance information, you can grab I/O performance statistics from the servers and clients. A simple example is NFS. A great tool, nfsiostat, allows you to capture statistics about NFS client and server activity. With respect to clients, you can grab information such as:

  • Number of blocks read or written
  • Number of reads and writes (ops/sec)

Or, you could gather the same information, but with O_DIRECT. With this information, you can get a histogram of the NFS performance of both clients and servers.

In addition to nfsiostat, you can use iostat, which collects lots of metrics on the storage server, such as CPU time, throughput, and I/O request times. You can also use iostat to monitor I/O on client nodes.

Likely, you are already using filesystem tools, so you can easily look for errors in the filesystem logs and collect them (script this). These are logs are specific to a filesystem, so be sure to read the manuals on what is being recorded.

Summary

A number of system administrators are reluctant to log much more than the minimum necessary, primarily for compliance. However, I’m a big believer that having too much information is better than not having enough. More logs means more space used and probably more network traffic, but in the end, you have a set of system logs that you can use to your advantage.

To review, here are four highlights:

  • Log everything (within reason).
  • Put a time stamp on it.
  • Put a node name on every entry.
  • Be a lumberjack, and you’ll be OK.