Lead Image © Peter Hermes Furian, 123RF.com

Measuring the performance health of system nodes

Peak Performance

Article from ADMIN 70/2022

By Jeff Layton

Many HPC systems check the state of a node before running an application, but not very many check that the performance of the node is acceptable before running the job.

In a previous article [1] I discussed prolog and epilog scripts with respect to a resource manager (job scheduler). Most of the prolog examples were simple and focused on setting up the environment before running a job, whereas the epilog example cleaned up after an application or workflow was executed. However, prolog and epilog scripts are not limited to these aspects. One aspect of prolog/epilog scripting that I didn't touch on was checking the health of the nodes assigned to the job.

Generically, you can think of a node health check as determining whether a node is configured as it should be (i.e., setting the environment as needed, which I discussed somewhat in the previous article) and is running as expected. This process includes checking that filesystems are mounted correctly, needed daemons are running, the amount of memory is correct, networking is up, and so on. I refer to this as the "state" of the node's health.

In that same article, I mentioned Node Health Check (NHC) [2], which is used by several sites to check the health of nodes, hence the name. In my mind, it focuses on checking the "state" of the node, which is a very important part of the health of a node. A large number of options can be turned on and off according to what you want to check about the state of the node.

Almost 20 years ago, when I worked for a Linux high-performance computing (HPC) company that no longer exists, we had a customer who really emphasized the number of nodes that were available for running jobs at any one time. One of the ways we measured this was to run a short application and check the performance against the other nodes. If the performance was up to or close to that of the other nodes, the node was considered "up" and available for users to run jobs. Otherwise, the node was considered down and not used to

...

Use one of the options below to read the full article