Vendors

Monitoring HPC Systems

Nerve Center

When you know better, you do better – Maya Angelou

Monitoring clusters and understanding how the cluster is performing is key to helping users better run their applications and to optimizing the use of cluster resources.

Such information is valuable for a variety of reasons, including understanding how the cluster is being used, how much of the processing capability is being used, how much of the memory is being used for user applications, and what the network is doing and whether it is being used for applications. This information can help you understand where you need to make changes in the configuration of the current cluster to improve the utilization of resources. Moreover, this information can help you plan for the next cluster.

In a past blog post, I looked at monitoring from the perspective of understanding what is happening in the system [1] (metrics) and how important it can be to understand the frequency at which you monitor the metrics.

If you put several cluster admins in a room together (e.g., the BeoBash [2]), and you ask, "What is the best way to monitor a cluster?" you will have to duck and cover pretty quickly from the huge number of opinions and the great passion behind the answers. Having so many options and opinions is not a bad thing, but you need to sort through the ideas to find something that works for you and your situation.

In two further blog posts [3] [4], I wrote some simple scripts to measure metrics on a single server as a starting point for use in a cluster. This code measured the processes of interest by collecting data on an individual node basis.

Now it's time to look at monitoring frameworks where, I hope, the scripts will be useful for custom monitoring and

...

Use Express-Checkout link below to read the full article (PDF).