Gathering data on various aspects of your HPC system is a key step toward developing information about the system and one of the first steps toward tuning your system for performance and reporting on system use. It can tell how users are using the system and, at a high level, what they are doing. In this article, I present a method for gathering data on how users are using Environment Modules, such as which modules are being used, how often, and so on.

Gathering Data on Environment Modules

How many people can answer questions about how their HPC system is used by their users? Questions such as:

    • What are the most popular applications used on their system?
    • Which compilers or versions are the most popular?
    • What is the number of nodes/cores for the average job?
      What is the standard deviation (SD)?
    • What is the average run time for jobs (SD)?
    • What is the node load history for these jobs?
    • When does the resource manager queue have the fewest
       jobs?
    • What time of the day do most people submit their jobs?
    • Which user has the largest number of files? How many files?
    • How quickly is the data growing?

The answers to these really few simple questions can you help understand how a system is functioning. Understanding how the system is being used is an extremely valuable function of any system administrator. Using the requisite car analogy, it’s like driving car without ever watching the instruments or checking the oil or looking at the tires or adding gas. You are just a passenger letting the system manage you.

A few management and monitoring tools in HPC can gather data on the state of the system, but not many. Moreover, very few reporting tools can use this data. Being able to take raw data, create information from it and then, one hopes, create knowledge from it, resulting in understanding, is a key to having the best running HPC system possible and being able to explain or show management how it’s being utilized.

In this article, I won’t create a single tool or tools that can suddenly answer all of these questions; rather, I present a method that allows you to gather data about aspects of the system and then create information from the data. More specifically, I show you how to gather data around Environment Modules and put them into a log that can be parsed and used to create information.

Approach

The approach I’m using is based on the method Harvard University’s Faculty of Arts and Sciences, Research Computing Group has developed for Environment Modules usage monitoring. They track which modules are loaded, so they can see which ones are popular (they have several thousand possible modules that encompass applications as well as libraries and compilers). Their approach is to use a wrapper  script for loading Environment Modules that captures which module the user is loading. That data is loaded into a database that they can then query to create information from the data. A wrapper script is a simple script that executes the desired binary on behalf of the user. But because it’s a script, you can do all sorts of things, such as gathering data on which module the user is requesting.

I think this concept is great because it’s so easy. It’s very simple to place a script in $PATH so that the script is executed first and not the command. You can then gather data on the user’s command and store that before executing the real command. To limit the user’s ability to bypass the wrapper script, you can change the name of the binary being executed to something anonymous such as gkja. It is unlikely the user knows that gkja really is the “modules” binary.

Typically to load a module with Environment Modules, you use the command

$ module load compiler/open64-5.0

where the first part of the command is the module command, the second part is the module function, and the third part is the module you are loading. To make things easier, use Bash as the scripting language that will capture the information you want:

  • What is the module command (i.e., load)?
  • What is the module being loaded or unloaded?
  • When does the module operation occur?
  • Which user is executing the script?
  • What group is the user using?

Although you could gather other information, this is a good start.

For the purposes of this article, the data gathered will be put in a simple logfile that can then be parsed and used by other languages (i.e., Python, Perl, Matlab, R, MySQL, etc.). The basic layout of a line of data in the logfile will be

Environment_Modules, [user], [group], [time], [script], [command], [module]

The seven fields are comma separated, which also makes them amenable to spreadsheets (CSV files). The first item signifies an Environment Module entry because, perhaps at some point, you might want to add other types of entries to the log. The user and group entries are self explanatory, but notice that they are given in numeric units from /etc/password and /etc/groups. The time is computed in number of seconds since 1970-01-01 00:00:00 UTC. The script entry records the name of the wrapper script that was used, because you might want several tools to write information to the log. The command entry is the module command that is used with Environment Modules. Finally, the module entry lists the module, if there is one, that the user is specifying. If there is no module, the entry is left blank.

Prepping the Environment

Before diving into the wrapper script, you should review the overall features of the data gathering process. Specifically, the requirement for gathering information about Environment Modules using the wrapper script approach are as follows:

  • The script gathers information about Environment Module usage that is written to a central logfile for all nodes in the cluster (typically a NFS directory that is mounted on all nodes).
  • The user can’t tell the difference between the wrapper script and the real module command.
  • The user cannot see or modify the logfile.
  • The user cannot easily execute the module command, bypassing data gathering.
  • The process should be generic enough that other wrapper scripts or other tools can be added at a later time.

Keeping these requirements in mind, you can now begin to lay the ground work for the wrapper script (i.e., “prepping the environment”).

If you recall, in the first article I wrote about Environment Modules, I said to copy the modules.bash and modules.sh files to the directory /etc/profile.d. However, to limit the ability of an ordinary user to bypass the wrapper script, you need to move these scripts to a different directory, which in this example I choose to be /etc/cluster_tools. First, as root, create this new directory and move the module files there.

[root@test1 ~] mkdir /etc/cluster_tools
[root@test1 ~] mv /etc/profile.d/modules.* /etc/cluster_tools
[root@test1 ~] chmod 755 /etc/cluster_tools/modules*

In the original article, I also said to place the command /etc/profile.d/modules.sh in either /etc/profile, /etc/bashrc, or your own personal .bashrc or .profile file. If you want to gather the information about Environment Modules using the approach in this article, you have to “hide” this command. However, if you don’t want to erase the command – or you can’t – you should be fine because the scripts /etc/profile.d/modules.sh and /etc/profile.d/modules.bash don’t exist in that path anymore.

One change needs to be made to /etc/cluster_tools/modules.sh. To make sure users can’t bypass the data gathering, we need to change the module command in this file. You can pick any name you want, and you can make it rather strange if you like. The following listing shows you the before and after versions of the line I used.

Before

module() { eval `/opt/Modules/$MODULE_VERSION/bin/modulecmd sh $*`; }

After

jmjodule() { eval `/opt/Modules/$MODULE_VERSION/bin/modulecmd sh $*`; }

I chose the name jmjodule arbitrarily. You can chose whatever you like.

To keep things centralized for HPC clusters, I’m going to put the wrapper script in a central location that is shared across the entire cluster. In this case, I will use an NFS-exported directory that I created for my second Warewulf article. In particular, I will create the directory /opt/cluster_tools in which I will store the wrapper script:

[root@test1 ~] mkdir /opt/cluster_tools

The second shared directory I will need to create is for storing the logs. I want the directory to be common to all compute nodes, so no logs are written locally. From the Warewulf series of articles, I will use the directory /opt/logs, and I will keep the module logs as the file modules in that directory.

[root@test1 ~] mkdir /opt/logs
[root@test1 ~] touch /opt/logs/module
[root@test1 ~] chmod 600 /opt/logs/module

The last command is purely optional, but I do it to make sure the file is there.

The last step is to make sure the path to the wrapper script is early in the $PATH variable. This isn’t too difficult to do by either (1) modifying /etc/profile, (2) modifying /etc/bashrc, or (3) modifying $PATH in your own account. For the purposes of demonstrating how to do this, I will modify my .bashrc file with the following line:

export PATH="/opt/cluster_tools:$PATH"

This line could easily be the last line in /etc/bashrc so that any user who used the Bash shell would have the correct $PATH in their account.

Wrapper Script

I based the wrapper script on the work done at Harvard, shown in Listing 1.

Listing 1: Wrapper Script

#!/bin/bash

# Invoke the normal "modules"
. /etc/cluster_tools/modules.sh

# Definitions
base="Environment_Modules"
comma=","

# Create UID and GID for user running script
uid=$(id -ur)
gid=$(id -gr)

# This returns time in seconds since 1970-01-01 00:00:00 UTC
time=$(date +"%s")

# Create final string for output to file
final_string=$base$comma$uid$comma$gid$comma$time$comma$0$comma$1$comma$2

# Write to log (need to finish this)

# Actually run module command (use full path to module command)
jmjodule $*
/bin/bash

Briefly, from the top of the script to the bottom, you:

  • Invoke the normal modules via the script /etc/cluster_tools/modules.sh.
  • Define some strings that will be used later.
  • Get the user ID (UID) and group ID (GID) for the user invoking the script.
  • Get the time in seconds since 1970-01-01 00:00:00 UTC.
  • Create the final comma-delimited output string.
  • Write the string to the log (which is covered in the next section).
  • Execute the “real” module command, which was renamed jmjodule.
  • Execute a Bash shell.

Note that no real effort has been made to make this script ultrasecure or perform any real error checking. The last item needs a small amount of explanation. Because of the nature of Environment Modules, without invoking the Bash shell, none of the module commands that modify the environment would happen. The side effect of this command is that if you exit or log out of the shell, you will need to exit the shell every time you execute the wrapper script.

I put the script in /opt/cluster_tools/modules and changed the permissions to 755 with the chmod command. Note that I changed the wrapper script to modules rather than module. I tend to like this better, but you can definitely name the wrapper script module if you like.

One piece of business is unfinished – getting the data to a central log.

Logger (Not Frogger)

Providing a logging feature that creates a log entry for a user action but does not allow the user to read or touch the file is challenging. Normally, to write to a file, one needs to have the correct file permissions, which also means being able to edit, erase, or modify the logs, and I don’t want this. I want something like the system log, /var/log/messages, where user actions create a log entry but the user cannot access the logfile. The answer to this conundrum is called logger.

Logger allows users to add comments or entries to logfiles, and it’s pretty simple to use:

[laytonjb@test1 ~]$ logger "This is a test"
...
[root@test1 ~]# tail -n 2 /var/log/messages
Aug 22 15:54:47 test1 avahi-daemon[1398]: Invalid query packet.
Aug 22 17:00:02 test1 laytonjb: This is a test

The logger command in this example was run first by a user, but /var/log/messages was read as root. As a user, I was able to add a line to the system log. I can also tell logger which log to use:

[laytonjb@test1 ~]$ logger -p cron.notice "This is a cron test"
...
[root@test1 ~]# tail -n 2 /var/log/cron
Aug 22 17:10:01 test1 CROND[7438]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lock/mrtg/mrtg_l --confcache-file /var/lib/mrtg/mrtg.ok)
Aug 22 17:12:50 test1 laytonjb: This is a cron test

Logger allows you to direct your comments to specific logs, but it won’t allow you to direct them to an arbitrary file.

Logger looks like the tool I need – but I don’t want to write to a system log. I want to write to a centrally located and dedicated log that is not a normal logfile and is not located in /var/log. According to a conversation at Stack Exchange, the syslog interface only allows for a fixed set of what are called “facilities,” which are defined by constants in /usr/include/sys/syslog.h in the kernel. So, logs like /var/log/messages and /var/log/cron are predefined, and you can’t change them. However, a cool aspect of syslog creates provisions for custom facilities that are called local0 through local7. Using one of these “local” logs, you can define your own private log.

For the following example, I’ll use a Scientific Linux 6.2 system (other distributions may behave differently). For my system, I define the log and its location in the file /etc/rsyslog.conf. The file should have a RULES section that looks like Listing 2.

Listing 2: Defining the Log in /etc/rsyslog.conf

#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!

*.info;mail.none;authpriv.none;cron.none;local0.none    /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog


# Log cron stuff
cron.*                                                  /var/log/cron

# Log Module stufflocal0.*                              /opt/logs/module

# Everybody gets emergency messages
*.emerg                                                 *

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log

The lines I added or changed were

*.info;mail.none;authpriv.none;cron.none;local0.none    /var/log/messages

and

# Log Module stufflocal0.*                              /opt/logs/module

Notice that I used local0 as my log facility, and I used the location of /opt/logs/module for my logfile. (Note: local7 is being used as a boot log.) Before I can test it, I need to reboot the system (I tried restarting the syslog service, but it didn’t work).

After a reboot, I can test that the log was defined with the use of logger from the command line.

[laytonjb@test1 ~]$ logger -p local0.notice "This is a test"
[laytonjb@test1 ~]$ more /opt/logs/module
/opt/logs/module: Permission denied
[laytonjb@test1 ~]$ su
Password:
[root@test1 laytonjb]# more /opt/logs/module
Aug 22 09:41:20 test1 laytonjb: This is a test

Notice that a user cannot look at the log or modify it, but root can.

No it’s time to finish the wrapper script. The final version of the script I used is in Listing 3.

Listing 3: Final Wrapper Script with Logging

#!/bin/bash

. /etc/cluster_tools/modules.sh

# Definitions
base="Environment_Modules"
comma=","

# Create UID and GID for user running script
uid=$(id -ur)
gid=$(id -gr)

# This returns time in seconds since 1970-01-01 00:00:00 UTC
time=$(date +"%s")

# Create final string for output to file
final_string=$base$comma$uid$comma$gid$comma$time$comma$0$comma$1$comma$2

# Write to log
logger -p local0.notice $final_string

# Actually run module command (use full path to module command)
jmjodule $*
/bin/bash

Now can put the wrapper script to the test to see if it works. Listing 4 below has some commands and tests that I ran.

Listing 4: Wrapper Script Test

[laytonjb@test1 ~]$ modules avail

---------------------------------- /opt/Modules/versions ----------------------------------
3.2.9

----------------------------- /opt/Modules/3.2.9/modulefiles ------------------------------
compilers/gcc/4.4.6         module-info                 mpi/openmpi/1.6-open64-5.0
compilers/open64/5.0        modules                     null
dot                         mpi/mpich2/1.5b1-gcc-4.4.6  use.own
lib/atlas/3.8.4             mpi/mpich2/1.5b1-open64-5.0
module-cvs                  mpi/openmpi/1.6-gcc-4.4.6
[laytonjb@test1 ~]$ modules list
No Modulefiles Currently Loaded.
[laytonjb@test1 ~]$ modules load compilers/open64/5.0
[laytonjb@test1 ~]$ modules list
Currently Loaded Modulefiles:
  1) compilers/open64/5.0
[laytonjb@test1 ~]$ modules load mpi/openmpi/1.6-open64-5.0
[laytonjb@test1 ~]$ module list
Currently Loaded Modulefiles:
  1) compilers/open64/5.0         2) mpi/openmpi/1.6-open64-5.0
[laytonjb@test1 ~]$ modules unload mpi/openmpi/1.6-open64-5.0
[laytonjb@test1 ~]$ modules list
Currently Loaded Modulefiles:
  1) compilers/open64/5.0
[laytonjb@test1 ~]$ modules load mpi/openmpi/1.6-open64-5.0
[laytonjb@test1 ~]$ modules list
Currently Loaded Modulefiles:
  1) compilers/open64/5.0         2) mpi/openmpi/1.6-open64-5.0
[laytonjb@test1 ~]$ modules purge
[laytonjb@test1 ~]$ modules list
No Modulefiles Currently Loaded.
[laytonjb@test1 ~]$ module avail
bash: module: command not found

Notice that when I tried to use the normal module command, it couldn’t be found (this was intentional – to make bypassing the data gathering more difficult).

Now, to look at the logfile, I have to be root. Listing 5 has the lines in the logfile that correspond to the commands in Listing 4.

Listing 5: Logfile Entries for Listing 4

Aug 25 16:24:34 test1 laytonjb: Environment_Modules,500,500,1345926274,/opt/cluster_tools/modules,avail,
Aug 25 16:24:47 test1 laytonjb: Environment_Modules,500,500,1345926287,/opt/cluster_tools/modules,list,
Aug 25 16:24:51 test1 laytonjb: Environment_Modules,500,500,1345926291,/opt/cluster_tools/modules,load,compilers/open64/5.0
Aug 25 16:24:53 test1 laytonjb: Environment_Modules,500,500,1345926293,/opt/cluster_tools/modules,list,
Aug 25 16:25:01 test1 laytonjb: Environment_Modules,500,500,1345926301,/opt/cluster_tools/modules,load,mpi/openmpi/1.6-open64-5.0
Aug 25 16:25:03 test1 laytonjb: Environment_Modules,500,500,1345926303,/opt/cluster_tools/modules,list,
Aug 25 16:25:11 test1 laytonjb: Environment_Modules,500,500,1345926311,/opt/cluster_tools/modules,unload,mpi/openmpi/1.6-open64-5.0
Aug 25 16:25:13 test1 laytonjb: Environment_Modules,500,500,1345926313,/opt/cluster_tools/modules,list,
Aug 25 16:25:23 test1 laytonjb: Environment_Modules,500,500,1345926323,/opt/cluster_tools/modules,load,mpi/openmpi/1.6-open64-5.0
Aug 25 16:25:25 test1 laytonjb: Environment_Modules,500,500,1345926325,/opt/cluster_tools/modules,list,
Aug 25 16:25:36 test1 laytonjb: Environment_Modules,500,500,1345926336,/opt/cluster_tools/modules,purge,
Aug 25 16:25:38 test1 laytonjb: Environment_Modules,500,500,1345926338,/opt/cluster_tools/modules,list,

In looking at the logs, you can see that module commands with no specified option or module end in a comma by design. However, the other commands, such as load or unload list a module.

Final Words

It might seem to be a bit of work to develop a process for gathering data on Environment Modules use, but really it’s not very difficult. Using a simple wrapper script and the nice Linux feature logger, I was able to create a process that allowed me to gather all sorts of data about how users are using Environment Modules and put it in a central logfile that users cannot access.

At this point, you could write tools to parse the log and do all sorts of data manipulation (hopefully producing some good information). For example, you can use Python to parse the data. Although I’m not a Python expert, you can perhaps use the Python string method split() to break each line of the logfile into pieces like this:

#!/usr/bin/python

example_string = "Aug 25 16:24:51 test1 laytonjb: Environment_Modules,500,500,1345926291,/opt/cluster_tools/modules,load,compilers/open64/5.0";

List1 = example_string.split()
print List1
n = len(List1);

List2 = List1[n-1].split(",");
print List2

You can do this sort of thing in Perl, Desperate Perl, Matlab, R, or almost any language you want.

As an example of how useful environmental module information can be, the HPC team at Harvard used their tools to do a quick survey of the modules their users were using and which ones were used the most. The first part of the results show they provide their users almost 2,200 modules!

In the second part, they look at which modules were used the most. The information they gathered in the article is really interesting, at least to HPC people.

This Environment Module audit at Harvard was the inspiration for this article, and I want to thank the Harvard team, particularly their fearless leader and Desperate Perl guru, Dr. James Cuff. They are a really innovative team that keeps a huge and diverse user base happy and productive.