Lead Image © Andrey_Russinkovskiy, 123RF.com

HPC resource monitoring for users

Close Companion

Article from ADMIN 62/2021

By Jeff Layton

Remora provides per-node and per-job resource utilization data that can be used to understand how an application performs on the system through a combination of profiling and system monitoring.

While chatting with some colleagues, we discussed how some applications return unusual results, even though previous runs had produced the expected results. As they were discussing ways to tell whether applications were running correctly or incorrectly, I thought it would be great to get a snapshot of what was happening on all of the nodes involved in a job. I like to think of this as "application telemetry."

A simple search on "telemetry" [1] brings up a definition like "… the process of recording and transmitting the readings of an instrument." In this case, the instrument is the high-performance computing (HPC) system, and the readings are resource aspects of the system (e.g., CPU, memory, network and storage usage, etc.). Here, the telemetry is used to help the user understand what their application was doing during execution. The keyword in that last sentence is user . The case in point: The user has access to resource usage for their application to spot problems, prompting me to revisit Remora.

Remora: REsource MOnitoring for Remote Applications [2], from the University of Texas Advanced Computing Center (TACC), combines monitoring and profiling to provide information about your application. Not strictly a profiler and not strictly a monitoring tool in the traditional sense of monitoring the entire cluster, Remora provides per-node and per-job resource utilization data that can be used to understand how an application performs on the system. The user (not just the admin) can go back and examine what was happening on systems while a job was running.

The goal of Remora is simplicity, which is achieved by using commonly installed tools that focus on the user, putting data and possibly information in the user's hands (and probably the admin's if an issue crops up). The data can also be used

...

Use Express-Checkout link below to read the full article (PDF).