What do you do with all of the HPC data you harvested as a lumberjack? You think like a Vegan.

What to Do with System Data: Think Like a Vegan

Ellie Arroway: You found the primer.
S. R. Hadden: Clever girl! Lights. ... Pages and pages of data. Over 63 thousand in all, and on the perimeter of each ...
Ellie Arroway:... alignment symbols, registration marks, but they don’t line up.
S. R. Hadden: They do, if you think like a Vegan. An alien intelligence is going to be more advanced. That means efficiency functioning on multiple levels and in multiple dimensions.
Ellie Arroway: Yes! Of course. Where is the primer?
S. R. Hadden: You'll see. Every three-dimensional page contains a piece of the primer; there it was all the time, staring you in the face. Buried within the message itself, is the key ...

- Contact (1997)

It's not enough to be a lumberjack

Based on the last article, everyone should either be a lumberjack or planning to be one. You should have developed a logging plan and have lots of data for your cluster. Now, we need to become Vegans too (yes, Vegan Lumberjacks).

Vegan sysadmins think differently than ordinary sysadmins. They look at system administration on "multiple levels and in multiple dimensions." Multiple levels means they think from a user's point of view, an operation point of view, and a management or funding point of view (without bucks, there is no Buck Rogers). Multiple dimensions means thinking of the system from a CPU perspective, memory perspective, network perspective, storage perspective, and perhaps most importantly of all, from an application perspective. Multiple levels and multiple dimensions go well beyond the classic definition of system administration, meaning we need to think like vegans. But if we do, we can achieve so much more than just break-fix or installing a new version of an application.

You're a crack lumberjack and you have all sorts of data (metrics) of the nodes in the cluster; now what do you do? Simply, but vaguely, put, you want to be able to parse the logs and search for data to create information. Sounds simple but it's not, and it is probably the most important thing you will do as a sysadmin. The very first thing you want to do is, think like a user (use your vegan powers).

The systems exist to run user jobs. Without them, the systems are pointless. Now we, the sysadmins, need to think like a user (think like a vegan). Thinking like a user will make the system better and will provide you with information that describes the state of the system.Some of the key things for users are:

  • The job executes to completion
  • The job executes in an expected amount of time
  • If repeated, the job produces the same answers in about the same amount of time
  • When a job is run, a summary of the resources used is given to the user
  • Applications or updates are quickly installed and made available including security updates
  • Tools for the creation, testing, profiling, and deployment of applications are made available and are up to date
  • There is enough space for retaining data

So we're thinking like a vegan in what we want, but to get there we need to think like sysadmins (thinking on multiple levels). This means that whatever tools or policies we create and implement, they should be done from the perspective of the administrator. Clearly, one thing that would fall into this category is not to make tools or policies for a single user (unless that user dominates the system). By doing this you end up with a Hodge-podge of tools and policies that are difficult to track, update, and use.

We also need to think of ways to report on the state of the system as a whole. Management will want to have a quick summary view of the state of the system. This also includes funding agencies or people who have committed the funding for the system (again, we're thinking on a different level). Questions pop up such as:

  • How is the system functioning? (always a generic question)
  • What is the backlog on the system?
  • Any problems or issues with the system?

The life of the system is about 3-5 years so, as sysadmins, we always need to be thinking of the next system (the workloads never goes away). Having information on system utilization, backlogs, etc., can help us create a justification for the new system.

User Needs

One of the best examples of thinking like a "user" vegan, is to look at Remora. This is a great tool that allows a user to get a high-level view of the resources they used when their application was run. It also works with MPI applications. Remora collects several streams of information:

  • Memory usage (CPUs, Xeon Phi, and Nvidia GPUs)
  • CPU utilization
  • I/O usage (Lustre, DVS)
  • NUMA properties
  • Network topology
  • MPI communication statistics
  • Power consumption
  • CPU temperatures
  • Detailed application timing

In addition to this information, users want to see how long their application ran and possibly any error messages from the job scheduler.

Because you are a lumberjack you have recorded these streams of data. You've also recorded the job scheduler information. Therefore it's not difficult to mine a database or search for data based on the user's job. Recall that we are thinking like a user so let's begin with job number from the scheduler. From this number you know the time the job started and ended as well as what nodes where used for the job. With this information you can gather the list of information over the duration of the run.

Having this information handy is great when a user has a question about their job. With a job number, you can pull the report and get a quick look of the behavior of the user's application from the perspective of the user (pretty snazzy). Now we're thinking like a "user vegan".

We can also think like a "sysadmin vegan" and use all of this data to capture a glimpse of what users are doing. You can do some simple statistical measurements such as computing the average, minimum, maximum, range, and deviation for a group of jobs. Perhaps you can do this for each day. Then you can create a histogram of the average job length throughout the year (same for range and other statistics).

Again, thinking like a sysadmin vegan, you could perform a "cluster" analysis of the jobs. Are there groups of the number of nodes requested? Are there are groups based on application run time and number of nodes? You can probably think of other statistics or metrics that interest you or are more applicable to your situation. The point is, that you have tons of data from being a lumberjack and being a vegan allows you to think like a user and develop statistics around them.