Log analysis in high-performance computing

State of the Cluster

Gathering logs from distributed systems for manual searching is a typical task performed in high-performance computing (HPC) [1]. Log analysis is important for cybersecurity, understanding HPC cluster behavior, and event and trend analysis. In this article, I address the state of the art in log analysis and how it can be applied to HPC.

Origins

Log analysis can produce information through a variety of functions and technologies, including:

ingestion
centralization
normalization
classification and logging
pattern recognition
correlation analysis
monitoring and alerts
artificial ignorance
reporting

Logs are great for checking the health of a set of systems and can be used to locate obvious problems, such as kernel modules not loading. They can also be used to find attempts to break into systems through various means, including shared credentials. However, these examples do not really take advantage of all the information contained in logs: Log analysis can be used to improve system administration skills.

When analyzing or just watching logs over a period of time, you can get a feel for the rhythm of your systems; for example: When do people log in and out of the system? What kernel modules are loaded? What, if any, errors occur (and when)? The answers to these questions allow you to recognize when things don't seem quite right with the systems (events) that "normal" log analysis might miss. A great question is: Why does user X have a new version of an application? Normal log analysis would not care about this query, but perhaps the user needed a new version and could indicate that others might also need the newer version, prompting you to build and make it available to all.

Developing an intuition of how a system

...

Use Express-Checkout link below to read the full article (PDF).