Updates and Upgrades in HPC
As my introduction to this article, I want to tell you a short story. I was working on my small Warewulf cluster to incorporate running containers. I tried Docker, Apptainer, enroot, and maybe something else. I was having trouble with all of them because of the prerequisites. Several times I installed them, tried something, then uninstalled them, ending up with compute nodes that no longer booted.
I tried removing packages from the container. I tried reinstalling packages during which the head node and the container were both updated to a new minor version of Rocky. I asked for help on the Warewulf Slack channel and directly asked a friend for help. I had a “crash cart” connected to the compute node to watch it boot. I could not fix it, but I wasn’t too upset because I had the HPC ADMIN articles and a bunch of notes.
I decided to start over with a fresh container, but before I did that, I thought about how I wanted to create the container this time. I could pull the Rocky 8 container and then maybe write a script to add the packages I needed or wanted, even if I added them in stages. I could also write a simple script that created a chroot, and I could build it up from layers. (Although I started this process, my day job didn’t allow me to get back to the project for over a month, and counting, so I haven’t been able to build and test the container from the chroot.) I could also write a Docker file to build up a container for myself. During these contemplations, I tried updating Rocky 8 on the head node and in the container, which really broke things with absolutely nothing working with the compute nodes.
At this point I started to think about general steps I could take to update or upgrade an HPC system, which is the subject of this article.
Update vs. Upgrade
I’m sure you remember the scientific method from school, but rather than begin with a hypothesis, I like to start with a problem statement that includes definitions. My Solutions Architect (SA) background has taught me that if you aren’t careful about defining terms and assumptions, miscommunications happen – as I discovered once when presenting to a customer, who assumed I was lying and stormed out of the meeting. Although this episode might not have been the result of not setting expectations and definitions, nonetheless, it was a fun time. Before diving into pontifications about updating or upgrading, I want at least to define “update” and “upgrade” to illustrate how they are very different.
If you poll 10 HPC administrators and ask them for their definitions of update and upgrade, you'll get maybe 16 answers that are all very different. To me, update refers to keeping a distribution with a specific major and minor number current. For example, keeping Rocky 8.6 up to date. As you move up the version ladder, things get a little murkier. I think of keeping a major number distribution up to date (e.g., from Rocky 8.6 to Rocky 8.7) as an update, although some disagree and think of this as an upgrade.
Going from one major distribution to another is what I think of as an upgrade (e.g., from Rocky 8 to Rocky 9). Before diving into a deeper discussion on updates and upgrades, here are a few assumptions I want to mention:
- storage is separate from compute (although for small systems such as mine, this isn’t always possible);
- you have a backup process that you are comfortable with;
- you have practiced storing from backups by actually doing a restore;
- the critical nodes in a cluster are usually the head node, the login nodes, control nodes for services such as Slurm, and any other node type that takes work to install and you don’t want to repeat; and
- sometimes login nodes are considered “critical”; that is, they are simple nodes that users log in to and from which they submit their jobs.
Update Within a Minor Version
An example of updating within a minor version is to keep Rocky 8.6 up to date with package updates that fall into the Rocky 8.6 definition. Within a minor version, package updates are almost always compatible with previous packages in that minor version. The goal is to keep everything compatible to minimize change but to incorporate, perhaps, security fixes or add minor features that do not break compatibility.
I say “almost always compatible” because in some cases a package maintainer introduces a minor package update that breaks compatibility. I’m sure the maintainer’s goal was not to do this, but it does happen. If this does happen, you need to be ready (more on that later).
For a variety of reasons I make the following recommendations for updating within a minor version:
- Do not make updates on a production system if you can help it.
- Make sure the test system matches the production system as closely as possible.
- Have a set of tests ready to run after the package updates.
- Document everything you do in as much detail as possible.
- Be ready to revert any package updates through your package manager. Be sure to practice using it so that you understand the idiosyncrasies.
- Have backups on the test system in case you have problems.
- When you are ready to roll out the update on a production system, have a maintenance window during which you can install the updates system-wide and be sure you can revert if you need to.
- Build in some time in case the updates don’t work or the post-update tests fail.
To elaborate on a few of these recommendations: Although you should never do even minor version updates on a production system, in some circumstances (e.g., my little cluster) I can’t adhere to this recommendation. Having a set of tests for checking the system after a package update is critical to ensuring there are no problems before rolling out to a production cluster. These tests should comprise basic commands, benchmarks, and user applications. Don’t make the set of tests too extensive or you will drive yourself crazy, but don’t make the set too small either.
Probably the most important thing you should do is document what you are doing. Your .bash_history can help a little, but it’s not enough by itself. A tool that can really help is script, which allows you to record what you are doing. Be sure not to change terminal tabs because script won’t pick up anything in that new tab.
I like script because it is a sequential text file that can be easily searched and can be massively compressed (it’s just text). The following are a couple of tips when using script:
- Periodically list the content of pwd, perhaps with something like ls -lstar.
- Frequently use the date command, so you know when things happened and their order.
- Type notes at the command prompt and any information you want to record. When you hit Enter you will get an error, but anything you entered will appear in the script file.
I might be a bit retentive, but I have been bitten in the past, and as the introduction to this article explains, I was bitten again.
Sometimes you need to revert a package update to a previous version. Most package managers allow you to do this, but before you need it, be sure you know the command options and practice reverting a package.
Complimentary to reverting a package with a package manager, understand whether you can install an older package to replace it. This event is not reverting to a package, which is usually just the previous version, but going back two or more versions. Also, you might need to do this if reverting a package doesn’t work.