What to Do with System Data: Think Like a Vegan

What do you do with all of the HPC data you harvested as a lumberjack? You think like a Vegan.

Ellie Arroway: You found the primer.
S. R. Hadden: Clever girl! Lights. ... Pages and pages of data. Over 63 thousand in all, and on the perimeter of each ...
Ellie Arroway:... alignment symbols, registration marks, but they don’t line up.
S. R. Hadden: They do, if you think like a Vegan. An alien intelligence is going to be more advanced. That means efficiency functioning on multiple levels and in multiple dimensions.
Ellie Arroway: Yes! Of course. Where is the primer?
S. R. Hadden: You'll see. Every three-dimensional page contains a piece of the primer; there it was all the time, staring you in the face. Buried within the message itself, is the key ...

- Contact (1997)

It's not enough to be a lumberjack

Based on the last article, everyone should either be a lumberjack or planning to be one. You should have developed a logging plan and have lots of data for your cluster. Now, we need to become Vegans too (yes, Vegan Lumberjacks).

Vegan sysadmins think differently than ordinary sysadmins. They look at system administration on "multiple levels and in multiple dimensions." Multiple levels means they think from a user's point of view, an operation point of view, and a management or funding point of view (without bucks, there is no Buck Rogers). Multiple dimensions means thinking of the system from a CPU perspective, memory perspective, network perspective, storage perspective, and perhaps most importantly of all, from an application perspective. Multiple levels and multiple dimensions go well beyond the classic definition of system administration, meaning we need to think like vegans. But if we do, we can achieve so much more than just break-fix or installing a new version of an application.

You're a crack lumberjack and you have all sorts of data (metrics) of the nodes in the cluster; now what do you do? Simply, but vaguely, put, you want to be able to parse the logs and search for data to create information. Sounds simple but it's not, and it is probably the most important thing you will do as a sysadmin. The very first thing you want to do is, think like a user (use your vegan powers).

The systems exist to run user jobs. Without them, the systems are pointless. Now we, the sysadmins, need to think like a user (think like a vegan). Thinking like a user will make the system better and will provide you with information that describes the state of the system.Some of the key things for users are:

  • The job executes to completion
  • The job executes in an expected amount of time
  • If repeated, the job produces the same answers in about the same amount of time
  • When a job is run, a summary of the resources used is given to the user
  • Applications or updates are quickly installed and made available including security updates
  • Tools for the creation, testing, profiling, and deployment of applications are made available and are up to date
  • There is enough space for retaining data

So we're thinking like a vegan in what we want, but to get there we need to think like sysadmins (thinking on multiple levels). This means that whatever tools or policies we create and implement, they should be done from the perspective of the administrator. Clearly, one thing that would fall into this category is not to make tools or policies for a single user (unless that user dominates the system). By doing this you end up with a Hodge-podge of tools and policies that are difficult to track, update, and use.

We also need to think of ways to report on the state of the system as a whole. Management will want to have a quick summary view of the state of the system. This also includes funding agencies or people who have committed the funding for the system (again, we're thinking on a different level). Questions pop up such as:

  • How is the system functioning? (always a generic question)
  • What is the backlog on the system?
  • Any problems or issues with the system?

The life of the system is about 3-5 years so, as sysadmins, we always need to be thinking of the next system (the workloads never goes away). Having information on system utilization, backlogs, etc., can help us create a justification for the new system.

User Needs

One of the best examples of thinking like a "user" vegan, is to look at Remora. This is a great tool that allows a user to get a high-level view of the resources they used when their application was run. It also works with MPI applications. Remora collects several streams of information:

  • Memory usage (CPUs, Xeon Phi, and Nvidia GPUs)
  • CPU utilization
  • I/O usage (Lustre, DVS)
  • NUMA properties
  • Network topology
  • MPI communication statistics
  • Power consumption
  • CPU temperatures
  • Detailed application timing

In addition to this information, users want to see how long their application ran and possibly any error messages from the job scheduler.

Because you are a lumberjack you have recorded these streams of data. You've also recorded the job scheduler information. Therefore it's not difficult to mine a database or search for data based on the user's job. Recall that we are thinking like a user so let's begin with job number from the scheduler. From this number you know the time the job started and ended as well as what nodes where used for the job. With this information you can gather the list of information over the duration of the run.

Having this information handy is great when a user has a question about their job. With a job number, you can pull the report and get a quick look of the behavior of the user's application from the perspective of the user (pretty snazzy). Now we're thinking like a "user vegan".

We can also think like a "sysadmin vegan" and use all of this data to capture a glimpse of what users are doing. You can do some simple statistical measurements such as computing the average, minimum, maximum, range, and deviation for a group of jobs. Perhaps you can do this for each day. Then you can create a histogram of the average job length throughout the year (same for range and other statistics).

Again, thinking like a sysadmin vegan, you could perform a "cluster" analysis of the jobs. Are there groups of the number of nodes requested? Are there are groups based on application run time and number of nodes? You can probably think of other statistics or metrics that interest you or are more applicable to your situation. The point is, that you have tons of data from being a lumberjack and being a vegan allows you to think like a user and develop statistics around them.

Sysadmin needs

Sometimes we need to think a little more like a sysadmin vegan. A good way to start is by focusing on the questions that interest the people who are funding the system or who have spoken up in support of it.

High Level Stats

The high level questions that typically get asked are around utilization of the system and if it has had an impact on the users in accomplishing their work. Useful information examples are:

  • Time history of system utilization over the year (just a simple number that indicates how much the system was used)
  • Histogram of job backlog
  • Histogram of job length (how long did the jobs run?)
  • Histogram of the number of nodes or cores requested per job
  • Histogram of storage utilization (storage usage as a function of time)

You can probably think of other metrics that apply to your situation, but don't forget that people higher in the management chain don't want to see gory details and explanation of the data - give them the highlights but have have backup information at the ready.

There are times when you need to present the gory details. Perhaps you need to analyze the system utilization, or present some details to the sysadmin team, or management wants to jump into the details so the get a better understanding.

Gory detailed statistics

This is where you can show off by watching how users are utilizing the system and create statistics to capture the behavior. You've got lots of data so let your imagination loose. Examples include,

  • An ordered list of which users used the most core hours
  • An ordered list of which users submitted the most jobs
  • An ordered list of the top applications based on core-hours
  • Most popular environment modules (article)
  • An ordered list of users based on storage usage (who has the most data)
  • Which users have the lowest utilization in terms of core-hours. This can be measured by the core utilization (e.g. they asked for 16 cores but only used 1 core - that is 1/16 utilization).
  • Most popular time of the data for submitting jobs. Most popular day of the week for submitting jobs
  • Time of the day for the largest queue backlog (if any)
  • Number of available nodes over time (this lets you show node down times and fixing or upgrading nodes)
  • Amount of time waiting in the job queue (as a function of time)
  • Network utilization over time
  • Memory issues over time (counting single bit error corrections)
  • Node temperatures over time and as a function of rack position (this can help identify cooling issues).

And the list can go on. From statistics such as these you can gain some interesting insight into how the system is being used.

With information such as this, you can get a better idea of how the system is being utilized. With this information you are in a position to help users. I tend to think toward the "carrot" approach with users. With this information, we can pose questions such as:

  • How can I help users take better advantage of the system?
  • Is there a better time or day to submit jobs to get through the queue quickly?
  • What are the worst times and days to submit jobs?

With a little more work you can begin to think about how you can make the system easier to use. Overall, you are not punishing users for utilizing the system, rather you are looking for clues as to why the system isn't being better utilized or why users behave the way they do.

Some of the information can also be used to focus more on individual users. It can be used to spot users who are not quite behaving the way they should. For example, if a user has a particularly large amount of data stored, it might be good to sit down with them and ask why. You can also use the opportunity to help them find a better way to store data. Perhaps they are storing everything as pure text and not storing as binary data.

The same is true for CPU utilization. If a particular user is asking for 16 cores or a complete node (exclusive access) yet their application load is low, it might be a sign that they are having application trouble. It would be good to talk to them about what they're doing so that you can make things better - running more jobs, running jobs faster, etc. It can also help you find users who are asking for lots of cores, yet their application is purely serial. You can teach them how to ask for 1 core or determine the amount of memory they need (select the minimum number of cores to meet the required memory).

This last point can also help you in architecting or re-architecting the system. If it seems that a large number of users are only utilizing a single core, yet resources are schedule on a "per node" basis, perhaps you want to switch that to a "per core" basis.

You might even want to consider creating an on-premise cloud for these users so they can spin up an instance that has the needed amount of cores and memory. Then the users can get exactly what they want. However, you will likely have to help the users determine how many cores and how much memory they need.

Speaking of re-architecting, examining the sorted list of application usage can help you determine which apps to focus on for the next system or perhaps helping better educate users on how to efficiently use the system. For example, let's assume that an application like NAMD is the most popular application used by far. Knowing this, you could spend some time optimizing the NAMD build to improve performance. Since it's the most used application, even small improvements in performance will help the overall system utilization.

Another example, assuming NAMD again, is what system resources are used. For example, you may have some GPUs in your system, yet your users are not utilizing them for their NAMD jobs. A little education for the users and all of a sudden the run time for NAMD drops significantly because they can start using the GPUs.

Conversely, if you don't have GPUs, and the top applications can take advantage of them, then you might want to consider adding GPU enabled nodes either to the current system or to a new one. You have the data and the analysis that supports using GPUs and you could even estimate the utilization of the GPUs.

I hope you're seeing the usefulness of having all of this data available because you're a good lumberjack. But being a lumberjack is not everything - you need to be a vegan as well and think on multiple levels and in multiple dimensions. Sysadmins that can do this are simply amazing - both management and the users love them.

Summary

These last two articles have been a little esoteric in that very little code has been listed. However, the concepts contained in the articles are extremely important. Understanding your customers, the users and management as well as yourself, is the key to being the best possible admin. After that, thinking like a lumberjack to gather data, and then thinking like a vegan to analyze and interpret the data are the steps you need to take.

Tags: HPC HPC