Initial Best Practices for Containers

General Practices

Although it will take a few articles to cover the best practices for creating and maintaining containers, I thought I would share some general practices I’ve collected myself, learned from others, and gathered from articles:

  • Create a help section: Being able to query a container for help information, without having to run the container and shell, is a great help to users. Singularity has the capability.
  • Include the Dockerfile or Singularity recipe in the container: Very often, people want to know how you built the container, so they can inspect the container to understand what’s in it, discover who created it and when, or just to learn about containers. I think putting the Dockerfile or Singularity recipe file in the container is a great idea.
  • Include a test section: Including a test section that can be executed to test the application in the container is a great way to be sure that the container is working properly. Ideally this section of the container should be able to be run without having to “shell” into the container. Singularity currently has this capability.
  • Don't put data in a container: I see people violate this important best practice all the time. Putting data in the container can greatly increase its size, making it difficult to work with. Instead, you want the datasets to be mounted in the container in a volume.
  • Keep to a single application per container: Although perhaps a controversial best practice, containers originally were designed to contain a single application. To run multiple applications you coordinated various containers. Theoretically, you can put multiple applications in a single container, but you greatly increase the complexity and size of the container.
  • Build the smallest container possible: If you build a container that is as small as possible, it’s easier to store and move around and will contain fewer packages, reducing the container’s attack surface. However, don’t get crazy and eliminate almost everything in a container; it’s a balance between size and usability.
  • Remove any unnecessary tools and files: This best practice is related to the previous item. You should only use the tools, libraries, and packages that are needed in the container, so you can reduce its size and the attack surface.
  • Use multistage builds: Multistage builds have a number of advantages, but one stands out: If you build the application in one layer, you can then, in a subsequent layer, remove the build tools, reducing the size of the container.
  • Think of containers as ephemeral objects: You should be able to stop, destroy, and rebuild a container with little effort. Once you are finished using the container, you can archive it for reproducibility. During the development phase, though, think of the container as ephemeral. HPCCM can be a great help in this regard.
  • Use official images when possible: Companies and organizations put out “official” containers that have been created, scanned, and updated with security patches, so you should use these containers as a starting point rather than try to build your own. However, be careful of the origin of a container. Just because a container is listed somewhere doesn’t mean it’s safe.

This sampling of best practices are not hard and fast rules; rather, they are guidelines you should consider when you start building and using containers.

In the next section, I talk about what I consider to be the most important guideline when creating containers: datasets in the container.

Container Datasets

When I first started using containers, I quickly learned that if you aren’t careful, the size of a container can grow quite large. One of the first containers I built was almost 10GB because I put all of my data in the container.

In many situations, people believe that containers have to be completely self-contained; that is, the container holds the application and the dataset for the application. The obvious advantage of this approach is that everything needed is in the container. Wherever you run the container, you will have everything you need, and you know you have the desired data.

Depending on the size of the dataset, the container can be very large, which increases the time to transfer the container to a node for execution. Also, the container will be slow to start. Both of these issues, although they don’t affect container execution, can drive users crazy.

Over time, containers also have a tendency to grow as code and libraries are added, which exacerbates all of the issues around container size and execution time and can force the specfile out of control (use HPCCM to limit this issue).

Also, the dataset included in the container might not be the data you want. Moreover, you are moving around a potentially large container, so maintaining the container and the data within becomes unwieldy.

A best practice is to separate the container with the application from the data, so you can control the size of the container. Instead, keep the datasets on the host filesystem. This practice is perfect while you are developing the code, trying different datasets, or using the code to solve different problems. Granted, separating the data from the application doesn’t maintain the container as the sole source of the application and data, but it does make code development and maintaining the application much easier.

An HPC filesystem such a Lustre, BeeGFS, or IBM Spectrum Scale allows you store very large datasets. If you run the container on a node that mounts one of these filesystems, you can access the data quite easily.

Test Data

Nonetheless, people might argue that having some data in the container could prove to be useful. One good argument for putting datasets in the container that many users do not take into account is to be able to check that the application produces a correct answer.

A best practice is to include some “small-ish” datasets in the container that will exercise the primary code paths. You should also include the “correct” output when the included data is used. The key is that the datasets should be fairly small, run fairly quickly, and not produce much output, but exercise the important code paths; although this might seem to be a difficult proposition, it greatly enhances the utility of the container.

An alternative to including smaller data sets is to write a script that can download the data. Making the data available on the web allows a simple wgetcommand to pull down the dataset into the container.

Singularity allows you to define a test section in the specfile with%test that is executed after the container build process is complete. However, you can also execute it any time you want, and is a great way to build so that you can test an application with data. To execute the test section of the container, use the command:

$ singularity image.simg test

Inside the test section you can put almost anything you want. For example, you could could download data and execute the application or even put in code to compare your output to the “correct” output to make sure you are getting the correct answers. The Singularity documentation has a simple example:

%test
    /usr/local/bin/mpirun --allow-run-as-root /usr/bin/mpi_test

This simple test runs the mpi_test application, which you could create in a script, put in the container, and execute in the test section. The possibilities for testing are many.

Summary

Containers are a very fast up-and-coming technology for HPC. Many sites are already using them to help users and code developers. Two strengths of containers are portability and reproducibility. As with any software technology, people have developed best practices for containers.

This article is the first in a series that will present and discuss the current best practices for containers. Although many of these practices are logical, some (e.g., not including datasets in a container) seem limiting, but they always end up helping container creators and users.