The Logic of Helping Users
There really is logic behind writing HPC articles.
Have you ever said something, and other people look at you like you’re crazy? My wife does this fairly often – for good reasons. In my mind the logic behind what I said is perfectly good, but I just failed to vocalize it. What I say sounds like nonsense, but in my brain it was very logical. Typically, the response my wife gives is a hearty laugh, sometimes accompanied by a shaking of the head in disbelief. But my wife, being the amazing person she is (once she stops laughing) asks about the logic behind my gibberish, and I explain it. Then she understands but still laughs at me.
To introduce this article, I thought I would give you some insight into how I arrived at this subject. I hope the logic is sound, but if not, you get a glimpse into my thought process, which may or may not send you screaming.
I originally started this article by using those fancy new AI tools that everyone has been talking about. In particular, I was looking for a list of possible problems or issues that HPC Administrators face. So, I tried Google Search to find this information. I asked the question and got an answer in return. (I believe Google Search uses the Gemini Model from the little checking I did.) From experience, the "AI Answer" at the top of the query can be interesting but doesn't answer the question or give me what I want. However, I had to start somewhere, so Google AI search results it was.
I tried several ways to ask the same basic question and recorded the answers. No matter the question, they were all roughly the same, but the final question I asked Google was, “What are the top questions from HPC administrators?”
The AI Answer I received had three main areas:
- Strategic and operational questions
- Technical and management questions
- Job and security questions
Under each of these points were two or three sub-questions with more detailed questions.
My goal was to write about the responses, not answer them; rather, I wanted to try to judge whether they were logical, accurate, and appropriate. Honestly, after writing a bit, I just couldn’t find much interesting to say, so I dropped that article.
I didn’t give up hope, and in about a week I decided to revisit the question. I asked almost the same question but in a different format. I also tried changing the word order and using synonyms. I ended with the question, “What are the current top issues for HPC administrators?” The answers were slightly different:
- Managing infrastructure complexity
- Cost control and resource optimization
- Navigating the hybrid cloud
- Addressing security threats
- Dealing with a talent and skills shortage
- Managing data and I/O
Again, each of the issues included more detailed topics.
These issues were a bit more technical than the previous questions, with some higher level thoughts such as concerns about talent, skills shortage, and security threats. The topics looked reasonable, and again I set off to write about the resulting answers, but this time I also wanted to try address some of the questions. Again, I started the article, and I wasn’t happy with results.
I then tried to answer some of the questions or issues, but I got bogged down in the details. The article also got increasingly large, even after breaking it into pieces. Once more, I stopped writing and dropped this article (0 for 2).
I took one final swing at having AI tell me what it thought the important issues were in HPC. This time I tried a different AI, not the one in Google search, but one that had a better reputation. I got some answers and started researching them. Guess what? The responses were taken directly from an HPC ADMIN article by Doug Eadline. The first was titled “Top Three HPC Roadblocks.” Now that I knew Doug had already answered the questions, there was nothing for me to do. By the way, Doug’s article is timeless, and you should read it or re-read it.
I was getting a bit burned out, so I took a break for about a week or so and came back and tried a fourth approach of taking all of the AI responses, as well as what I had written on each, and looking for common topics or themes. It took a bit of work and time, but I distilled it to one topic that I thought was worth writing about. This theme was how to have HPC systems integrate into an Enterprise world. I worked the focus of this theme down to, “How do you monitor and control HPC systems with Enterprise tools?” Again, I thought this would be a good topic, so I set off to write.
This is an important issue, particularly because commercial entities are adopting AI as part of the products or services, which means HPC or HPC-like systems will now be a part of the Enterprise. Because central IT was there first, so to speak, Enterprise-grade monitoring tools would be a requirement.
I soon discovered several tools are being used in the Enterprise world for monitoring systems. Almost all of them were commercial, and I learned that each one had its own set of prerequisites and requirements. They also seemed fairly “heavy” in their footprint or load on a system. Moreover, most of them could give you a quick glance at the overall cluster, but it wasn’t what I would call HPC-like.
Not to start telling war stories, but when I was an administrator at Lockheed Martin, central IT took over all the HPC systems and brought them into the IT fold. They put an enterprise daemon on each node of every HPC system. Then they wanted to connect the output from each node to a single location where there would be a huge display and each node would have a red or green light on the display indicating its status. The price to create this system was out of sight because it required the commercial vendor(s) to create custom software. The price was more than any single HPC system. Of course, it never happened, but it does illustrate the potential gulf between central IT and HPC.
Because each of the monitoring tools had their own requirements, it would be difficult to write about them in a generic sense. I don’t do product reviews, and I didn’t want to do eight plus individual articles. Moreover, I’m sure these companies know how to integrate their tools into HPC, so they don’t need my help.
This is article 5 that I started and stopped. Fortunately, I didn’t write too much about this topic, but the deadline was fast approaching. I went back to the previous attempts and looked again for the common thread. At this point I came back to a final issue that popped up on the AI output that dealt with helping users debug, run, and understand their applications. I liked this topic because it goes back to the users. This is something I could write about.
Now you know a little more about how I came up with ideas for topics. The journey has been a bit strange, even in my mind, but here I am, and I’m ready to jump in.
Helping Users Understand
In my system administration career, I’ve found that once you get past the installation of a system, which is definitely not trivial, you move into the operational phase of the system where helping users becomes the number 1 topic. This phase surrounds their applications, such as getting the needed tools in place, building the applications, running applications, and – what I think is most important – helping the users understand how their applications execute and how they can make them better (however you want to define “better”). However, I can’t write anything useful that applies to the range of applications, especially about debugging. However, I can discuss some tools and ideas around understanding their applications by showing how they can use monitoring tools.
I want to approach this topic from two directions: the user perspective and the admin perspective. Let me explain these in a bit more detail.
Teach an HPC User How to Fish
I have always liked the proverb “Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime.” The origins of this proverb are not quite clear, but it is thought to be from Chinese thinkers (philosophers). Regardless, the intent of the proverb is to teach a person to solve their problems rather than give them the answers because, from then on, they can solve their problems on their own. Moreover, I like to think that they can teach others how to solve their problems and so on. We all become teachers and we all benefit.
Toward this goal, I like to put tools in the hands of users so they can understand how the code executes and see possible places for improvement. However, user tools don’t have elevated privileges on the system, so some things are missing. They can use these tools all the time without having to involve system administrators, which gives the user much more flexibility. Not having elevated privileges is not a showstopper, but it does narrow what can be done. It is a great place to start fishing. One tool that I think is very useful in this regard is Remora from the Texas Advance Computing Center (TACC).
I won’t be discussing Remora in detail in this article. That will be in a subsequent article. (I hope the next one, but SC25 is coming up.) I have written about it in the past.
Remora (resource monitoring for remote applications) can be used by any user to run their applications. In the next article, I will go over the steps to install it, use it, and interpret what Remora produces. I’ll quickly discuss a little to perhaps whet your appetite.
You don’t have to do anything special with your application(s) when using Remora. You don’t have to compile in a special library to make it work, which means you can use it with commercial applications where you only get the binaries or with applications that your HPC center has built for you. Simply, when you run your code, you just preface it with the remora command, for example:
$ remora ./myapp.exe
or
$ remora mpirun [...] ./my_mpiapp.exe
Remora then captures lots of information and puts it in a subdirectory where you can postprocess and create a report.
No Remora-specific applications are used to gather information about the run. Rather, existing applications are used along with information parsed from the /proc/ table and other sources standard to Linux. A partial list of the tools and data sources used by Remora include:
- Memory usage, including CPUs, Xeon Phi, and NVIDIA GPUs
- CPU utilization
- I/O usage – Lustre, data virtualization service (DVS)
- Nonuniform memory access (NUMA) properties
- Network topology
- Message passing interface (MPI) communication statistics
- Power consumption
- CPU temperatures
- Detailed application timing
A few years ago, the esteemed Kent Milfeld of TACC, who was the primary developer of Remora, retired, so I thought Remora development had stopped. This was a few years ago, but if you look at the GitHub repository, Kent has been quietly working on the code. He did a 2.0.0 release on November 18, 2023, so some updates have come since the previous article I wrote. The next article about Remora should be very interesting.
El Rooto Monitoring
Remora covers userspace monitoring but it won’t capture some things. For example, it won’t capture system log information or information about what other users were doing on the system when the code ran. This information can be very important in some cases, so having these things monitored can help, which means the system administrator or someone with elevated privileges will have to become involved.
After writing about Remora, I will write about tools and techniques for gathering information that only elevated privileges can gather.
Summary
This article was just a peak into my mind about how I come up with article ideas. This one proved to be a bit more difficult for some reason, so I thought I would show you my thought process. Perhaps a piece of me also wanted to show that writing is not just a simple matter of running a few commands and writing something about it. It takes work, planning, and time for these to come to fruition.
This first attempt with AI to come up with articles taught me I need to learn a bit more about how to use it. However, I don’t plan to use AI to write any article, although I may experiment with using AI to help edit, because it can help with wording, grammar, and finding issues.
While I struggled, I think the result of two articles about how monitoring can help users understand how their applications use system resources will be useful.
In the meantime, enjoy SC25. Stay hydrated and get some sleep.