Managing Cluster Software Packages
Setting up and configuring an HPC cluster is not as difficult as it used to be; some nice provision tools allow almost anyone to get a cluster working in short order. One issue worth considering is how easy it is to change things once the cluster is working. For example, if you get a cluster set up and then a user comes to you and says, “I need package XYZ built with library EFG version 1.23,” do you re-provision things to meet your user’s needs, or is there an easy way to add and subtract software from a running cluster that is minimally intrusive?
The short answer is “yes.” Before I jump into how one can organize a cluster to be more malleable, some mention of provisioning packages will be helpful. Three basic methods are offered by various toolsets:
- Image Based – A node disk image is propagated out to nodes on boot. Different “rolls” (images) can be constructed for different packages. An example is Rocks Clusters.
- NFS Root – Each node boots and installs everything as NFS root except for things that change for each node (e.g., /etc, /var). This system can be run disk-less or disk-full. An example is oneSIS.
- RAM Disk – A RAM disk is created on each node that holds a running system image. The RAM disk system can be created in hybrid mode, wherein some files are available via NFS, and it can run disk-less or disk-full. An example is Warewulf.
A good description of Warewulf can be found in the HPC Admin series on Warewulf.
Regardless of the provisioning system, the goal is to make changes without having to reboot nodes. Not all changes can be made without booting nodes (i.e., changing the underlying provisioning); however, many application packages can be added or removed without too much trouble if some simple steps are taken.
Dump It into /opt
On almost all HPC clusters, users have a globally shared /home, and a globally shared /opt path is possible as well. NFS is used on small to medium-sized clusters to share these directories. On larger clusters, some type of parallel filesystem might be needed. In either case, a mechanism always exists to share files across the cluster.
The simplest method is to install packages in /opt. This approach has the advantage of “install once available everywhere,” although you might have to address some issues with logfiles; however, in general, this method will work with most software applications.
The main issue administrators must deal with is dynamic library linking. Because package are not installed in the standard /usr/lib path and you don’t want to copy package entries into /etc/ld.so.conf/ on the nodes, you need a way to manage the location to the libraries. Of course, doing full static linking is one possibility, and using the LD_LIBRARY_PATH is another, but both of these solutions put some extra requirements on users, and ultimately it comes back to the sys admin to support problems with these approaches. The preferred method is to install packages that “just work.”
The solution is very simple. First, create /opt/etc/ld.so.conf.d/ and have all the packages place their library paths in conf files, just as they would in /etc/ld.so.conf.d/. Next, make a small addition to /opt/etc/ld.so.conf on all nodes. (i.e., it needs to be part of the node provisioning step so it is there after the node boots.) The additional line is:
The new line tells ldconfig to search /opt/etc/ld.so.conf.d/ for additional library paths.
If a package is added or removed, all that needs to happen is a global ldconfig on all the nodes to update the library paths. This step is easily accomplished with a tool like pdsh. Thus, installing a package globally on the cluster is as simple as installing it in /opt, making an entry in /opt/etc/ld.so.conf.d/, and running a global ldconfig.
If, for instance, you have the current version of Open MPI installed and a user wanted to try the PetSc libraries with a new version, you could easily install and build everything in /opt and have the user running new code without rebooting nodes or having to instruct them on the nuances of LD_LIBRARY_PATH.
Now that you have a way to add and subtract packages easily from your cluster, you need to tell users how to use them.
Global Environment Modules
In a previous article, I described the Environment Modules package. (I have recently noted some other Admin HPC authors have covered the same topic, as well.) The use of Environment Modules provides easy management of various versions and packages in a dynamic HPC environment. One of the issues, however, is how to keep your Modules environment when you use other nodes. If you use ssh to log in to nodes, then you have an easy way to keep (or not keep) your module environment.
With some configuration, the SSH protocol allows passing of environment variables. Additionally, Modules stores the currently loaded modules in an environment variable called LOADEDMODULES. For example, if I load two modules (ftw and mpich2) and then look at my environment, I will find:
At this point, all I need to do is include this with all cluster SSH sessions, and then I can reload the Module environment. To pass an environment variable via ssh, both the /etc/ssh/ssh_config and /etc/ssh/sshd_config files need to be changed.
First, the /etc/ssh/ssh_config file needs the following line added to it:
AcceptEnv LOADEDMODULES NOMODULES
(NOMODULES will be explained later.) Keep in mind you can use the Host option in the ssh_config file to restrict the hosts that receive this variable. Similarly, the sshd_conf file needs the following line added:
SendEnv LOADEDMODULES NOMODULES
Once the SSHD service is restarted, future SSH sessions will transmit the two variables to remote SSH logins. Before the remote login can use modules, they must be loaded. This step can be done by adding a small piece of code to the user’s .bashrc script:
if [ -z $NOMODULES ] ; then LOADED=`echo -n $LOADEDMODULES|sed 's/:/ /g'` for I in $LOADED do if [ $I != "" ] ; then module load $I fi done else export LOADEDMODULES="" fi
As can be seen from this code, if NOMODULES is set, nothing is done, and no modules are loaded. If it is not set, each module listed in LOADEDMODULES is loaded. Also note that this assumes the module package and module files are available to the node. Consider the example below, in which two modules are loaded (fftw and mpich2) before logging in to another node (n0 in this case). On the first login, the modules are loaded on the remote node. On the second login, with NOMODULES set, no modules are available:
$ module list Currently Loaded Modulefiles: 1) fftw/3.3.2/gnu4 2) mpich2/1.4.1p1/gnu4 $ ssh n0 $ module list Currently Loaded Modulefiles: 1) fftw/3.3.2/gnu4 2) mpich2/1.4.1p1/gnu4 $ exit $ export NOMODULES=1 $ ssh n0 $ module list No Modulefiles Currently Loaded.
As was noted, one of the important assumptions is the availability of module files to all the nodes. By placing the module files in NFS-shared /opt, all the nodes can find the module files in one place, and they can be added or removed without changing the running image on the node.
Toward Cluster RPMs
The final ingredient to this recipe is to encapsulate both of the above ideas into package RPMs; that is, an RPM will install a package in /opt, make the entry in /opt/ld.so.conf.d, and install a module file. In that way, other then a global ldconfig, the entire package could be installed across the cluster in one step. Indeed, if pdsh (or similar) were required as part of the RPM installation process, the global ldconfig could be done by the RPMs (just like a local ldconfig is done by almost all RPMs).
Of course, building good RPMs takes some time, but once you have the basic “skeleton,” it is not that difficult to drop it into the configure/make/install steps for various packages. Once you have good cluster RPMs for your applications, however, installation and de-installation is simple, convenient, and cluster-wide.