What to Do with System Data: Think Like a Vegan

Gory detailed statistics

This is where you can show off by watching how users are utilizing the system and create statistics to capture the behavior. You've got lots of data so let your imagination loose. Examples include,

  • An ordered list of which users used the most core hours
  • An ordered list of which users submitted the most jobs
  • An ordered list of the top applications based on core-hours
  • Most popular environment modules (article)
  • An ordered list of users based on storage usage (who has the most data)
  • Which users have the lowest utilization in terms of core-hours. This can be measured by the core utilization (e.g. they asked for 16 cores but only used 1 core - that is 1/16 utilization).
  • Most popular time of the data for submitting jobs. Most popular day of the week for submitting jobs
  • Time of the day for the largest queue backlog (if any)
  • Number of available nodes over time (this lets you show node down times and fixing or upgrading nodes)
  • Amount of time waiting in the job queue (as a function of time)
  • Network utilization over time
  • Memory issues over time (counting single bit error corrections)
  • Node temperatures over time and as a function of rack position (this can help identify cooling issues).

And the list can go on. From statistics such as these you can gain some interesting insight into how the system is being used.

With information such as this, you can get a better idea of how the system is being utilized. With this information you are in a position to help users. I tend to think toward the "carrot" approach with users. With this information, we can pose questions such as:

  • How can I help users take better advantage of the system?
  • Is there a better time or day to submit jobs to get through the queue quickly?
  • What are the worst times and days to submit jobs?

With a little more work you can begin to think about how you can make the system easier to use. Overall, you are not punishing users for utilizing the system, rather you are looking for clues as to why the system isn't being better utilized or why users behave the way they do.

Some of the information can also be used to focus more on individual users. It can be used to spot users who are not quite behaving the way they should. For example, if a user has a particularly large amount of data stored, it might be good to sit down with them and ask why. You can also use the opportunity to help them find a better way to store data. Perhaps they are storing everything as pure text and not storing as binary data.

The same is true for CPU utilization. If a particular user is asking for 16 cores or a complete node (exclusive access) yet their application load is low, it might be a sign that they are having application trouble. It would be good to talk to them about what they're doing so that you can make things better - running more jobs, running jobs faster, etc. It can also help you find users who are asking for lots of cores, yet their application is purely serial. You can teach them how to ask for 1 core or determine the amount of memory they need (select the minimum number of cores to meet the required memory).

This last point can also help you in architecting or re-architecting the system. If it seems that a large number of users are only utilizing a single core, yet resources are schedule on a "per node" basis, perhaps you want to switch that to a "per core" basis.

You might even want to consider creating an on-premise cloud for these users so they can spin up an instance that has the needed amount of cores and memory. Then the users can get exactly what they want. However, you will likely have to help the users determine how many cores and how much memory they need.

Speaking of re-architecting, examining the sorted list of application usage can help you determine which apps to focus on for the next system or perhaps helping better educate users on how to efficiently use the system. For example, let's assume that an application like NAMD is the most popular application used by far. Knowing this, you could spend some time optimizing the NAMD build to improve performance. Since it's the most used application, even small improvements in performance will help the overall system utilization.

Another example, assuming NAMD again, is what system resources are used. For example, you may have some GPUs in your system, yet your users are not utilizing them for their NAMD jobs. A little education for the users and all of a sudden the run time for NAMD drops significantly because they can start using the GPUs.

Conversely, if you don't have GPUs, and the top applications can take advantage of them, then you might want to consider adding GPU enabled nodes either to the current system or to a new one. You have the data and the analysis that supports using GPUs and you could even estimate the utilization of the GPUs.

I hope you're seeing the usefulness of having all of this data available because you're a good lumberjack. But being a lumberjack is not everything - you need to be a vegan as well and think on multiple levels and in multiple dimensions. Sysadmins that can do this are simply amazing - both management and the users love them.

Summary

These last two articles have been a little esoteric in that very little code has been listed. However, the concepts contained in the articles are extremely important. Understanding your customers, the users and management as well as yourself, is the key to being the best possible admin. After that, thinking like a lumberjack to gather data, and then thinking like a vegan to analyze and interpret the data are the steps you need to take.