Lead Image © Peter Hermes Furian, 123RF.com

Lead Image © Peter Hermes Furian, 123RF.com

Measuring the performance health of system nodes

Peak Performance

Article from ADMIN 70/2022
Many HPC systems check the state of a node before running an application, but not very many check that the performance of the node is acceptable before running the job.

In a previous article [1] I discussed prolog and epilog scripts with respect to a resource manager (job scheduler). Most of the prolog examples were simple and focused on setting up the environment before running a job, whereas the epilog example cleaned up after an application or workflow was executed. However, prolog and epilog scripts are not limited to these aspects. One aspect of prolog/epilog scripting that I didn't touch on was checking the health of the nodes assigned to the job.

Generically, you can think of a node health check as determining whether a node is configured as it should be (i.e., setting the environment as needed, which I discussed somewhat in the previous article) and is running as expected. This process includes checking that filesystems are mounted correctly, needed daemons are running, the amount of memory is correct, networking is up, and so on. I refer to this as the "state" of the node's health.

In that same article, I mentioned Node Health Check (NHC) [2], which is used by several sites to check the health of nodes, hence the name. In my mind, it focuses on checking the "state" of the node, which is a very important part of the health of a node. A large number of options can be turned on and off according to what you want to check about the state of the node.

Almost 20 years ago, when I worked for a Linux high-performance computing (HPC) company that no longer exists, we had a customer who really emphasized the number of nodes that were available for running jobs at any one time. One of the ways we measured this was to run a short application and check the performance against the other nodes. If the performance was up to or close to that of the other nodes, the node was considered "up" and available for users to run jobs. Otherwise, the node was considered down and not used to

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Performance Health Check

    Many HPC systems check the state of a node b efore  running a n  application, but not very many check that the performance of the node is acceptable before running the job.

  • Benchmarks Don’t Have to Be Evil

    Benchmarks have been misused by both users and vendors for many years, but they don’t have to be the evil creature we all think them to be.

  • Using benchmarks to your advantage
    A collection of single- and multinode performance benchmarks is an excellent place to start when debugging a user's application that isn't running well.
  • Prolog and Epilog Scripts

    HPC systems can benefit from administrator-defined prolog and epilog scripts.

  • ClusterHAT

    Inexpensive, small, portable, low-power clusters are fantastic for many HPC applications. One of the coolest small clusters is the ClusterHAT for Raspberry Pi.

comments powered by Disqus