Even with tons of cores per node today, the traditional sets of tools are still serial-only, utilizing a single core; however, some of the more popular tools have parallel versions, allowing you either to use the extra cores to run the same command in parallel or to perform the same task across multiple nodes.

Parallel Tools

When *nix was born – and for a very long time afterward – systems only had a single core. The advent of “clusters” of nodes started people thinking about how to issue the same command to all nodes at the same time – in essence, running the command in parallel.

Then the world became more complicated, with more than one core per socket: first with dual-core processors (two cores per socket), then with quad-core, hexa-core, octa-core, and even 16-core processors per socket. Still, the tools used by administrators or users were fundamentally serial.

A third development, along with clusters and multicore, was the growth and success of parallel filesystems. These storage systems have the ability not only to stripe data across multiple disks to improve the underlying disk performance but also to stripe data across multiple storage nodes in the parallel filesystem, further improving performance. In essence, that gives you multiple disks for striping performance, multiple nodes for further striping performance, and multiple network connections (because of multiple nodes), all of which can greatly improve storage performance.

However, if I try to Gzip a file on a node with 64 cores, the process still only uses a single core. Sixty-three cores are sitting there idle until the Gzip finishes. Moreover, using a single core to Gzip a file on a 2PB Lustre system that is capable of 20GBps is like draining a swimming pool through a straw.

The same is true for a system administrator. If I want to Rsync files from one node to another, the process is inherently serial. Why can’t I invoke rsync so that is uses as many cores as it takes to saturate either the network or the storage? If I have a 20GBps filesystem but I’m doing serial Rsync operations, I’m not taking advantage of the underlying storage performance.

Many other people have also wondered what to do with all of those extra cores and parallel data storage, resulting in some parallel versions of classic tools that can now use multiple cores – or at least divide the work across multiple cores.

In this article, I want to discuss tools that have been parallelized, rather than using something like xargs or GNU parallel with existing serial tools, because, typically, using xargs or parallel means you have to write a script or a series of commands for your specific function rather than using a tool that is generally applicable. I’m sure you can write a generic script with GNU parallel, but you will have to make some assumptions that will limit the universality of the script.

However, if you are curious about using GNU parallel for serial commands so that all cores are utilized, the ADMIN HPC website has a good article for getting starting. The Super User website also has a pretty good discussion on using xargs or GNU parallel for parallel copies.

Copy and Checksum Tools

Copy and checksum tools are commonly used by administrators and users, but they are both single-threaded tools and unable to take advantage of multiple cores, multiple nodes, and parallel storage beyond what a single thread can achieve. One might be inclined to think that copy commands don’t take that long to complete, so saving a second or two might be the most you could save. However, when you copy directories that have thousands of files or when you copy a terabyte-sized file, the normal copy takes much longer than a few seconds. Moreover, some application pipelines use cp as part of a set of commands, so improving copy performance can have a noticeable effect on overall performance.

The first tool I ran across for parallel copy is mcp. It is part of a larger package called Mutil. Rather than start from scratch and write a parallel cp-like tool, the authors developed patches against the Linux cp and md5sum tools that come with coreutils, which is part of every Linux distribution I can think of. The patches allow the tools to be recompiled using OpenMP, so they can run on a single node utilizing as many cores as desired. However, they also provide multinode support (see below). The latest version of the tools patches a specific version of coreutils, coreutils-7.6, which is a little old; however, maybe enough interest will get them to update the coreutils version.

To use Mutil, you have to build cp and md5sum yourself (no binaries available). The process isn’t too difficult, but the following prerequisites are necessary:

  • GNU TLS library (optional but required for multinode TCP support). Tested with gnutls 2.8.6.
  • Libgcrypt library (optional but required for hashing support). Note that msum is not functional without this, but mcp is still usable. Tested with libgcrypt 1.4.5.
  • Libgpg-error library (optional but required for hashing support). Note that you must configure with --enable-static to ensure that the static version of this library exists. Tested with libgpg-error 1.7.
  • MPI library (optional but required for multinode MPI support). Tested with SGI Message-Passing Toolkit 1.25/1.26 but presumably any MPI library should work.

Because these tools are based on the base tools in coreutils, they are drop-in replacements for the coreutil cp and md5sum tools. A few additional options for both tools help tune them for the specific use case, so be sure to read the manual and the man pages.

The examples for using the tools are pretty simple for the base case but get a little more interesting if you use MPI and multinode capabilities. A simple example is:

% mcp -r /home/laytonjb/data1 /data/laytonjb/project1/data1

This example does a recursive copy from one directory to another. The number of cores is determined by an environment variable, sometimes named OMP_NUM_THREADS. Note that mcp is intended to be used for filesystems that are mounted on a specific node (mounted locally).

Mcp also has MPI capability for running on multiple nodes. An example of this is:

% mpiexec -np 16 mcp --mpi -r /home/laytonjb/data1 /data/laytonjb/project1/data1

This command is less like cp and more like an MPI application. The particular command uses 16 cores on the node where the command originates, then it does the multicore copy mcp as before, but with the --mpi option.

Some other examples are floating around that parallelize rsync. The basic concept is to break up the list of files to be processed into different groups and run multiple instances of the application against those groups (one instance per group). One example using Rsync in this manner is a shell script that does exactly this. A Ruby version also does something very similar.

Compression Tools

A data compression tool also is commonly used. Typically it’s used to compress files that aren’t used often or to reduce their size for transfer over the network, such as pre-staging the data to high-speed storage before the application is executed. Many times, it’s used in conjunction with the tar command to collect all of the files in a subdirectory into a single file and then compress it.

A number of parallel compression tools are based on many of the common compression tools gzip, bzip2, xz, and 7zip. Arguably the most common of these tools is gzip. A parallel version of gzip is called pigz. Pigz is a reimplementation of Gzip using the zlib library and pthreads. Using the Pthreads library means that Pigz is capable of using all of the cores on a node.

Pigz breaks a file up into 128KB chunks that are compressed in parallel along with a checksum. The compressed data then is written out in order, and a final check sum is made from the individual checks. The chunk size is controllable via a command-line option, as is the number of threads. Pigz doesn’t parallelize the decompression of a Gzip file, but it does use separate threads for reading, writing, and checksum computations. This can help improve performance, but not necessarily a great deal. John Allspaw has written a pretty good article about Pigz.

Using pigz is pretty easy:

% pigz file.tar

This uses the default compression level of 6, but personally, if I’m going to compress a file, I like to go all the way and use -9 as a command-line option (maximum compression). The details of the command-line options for Pigz are online.

Bzip2 has a couple of parallel implementations. The first one, pbzip2, is a threaded (Pthreads) implementation. It uses the bzip2 library, as well as the pthreads library. (You need to have these installed.) The Pbzip2 README explains that it works in a similar manner to Pigz – it breaks up the file into chunks and compresses each chunk in parallel. I can’t tell whether it does a parallel decompression, but I believe it does using the -d option.

Several examples appear on the Pbzip2 web page, but a simple example,

% pbzip2 massivetarball.tar

compresses the TAR file using the default options. It autodetects the number of cores and will use as many as it can, and it uses a default block size of 900KB (controllable). You can specify the number of threads to use and the block size, as in this example,

% pbzip2 -p 8 -b15vk massivetarball.tar

which uses eight threads and a block size of 1500KB (1.5MB).

A second parallel Bzip2 tool is mpibzip2. It is written by the author of Pbzip2 and is similar in its approach: It breaks up the file into chunks and compresses each chunk. However, rather than use threads, as Pbzip2 does, Mpibzip2 uses MPI to distribute the workload, which means you can use multiple nodes and multiple cores for compressing the file.

I haven’t built Mpibzip2, but in looking at the makefile, it appears to need an MPI library (naturally), as well as pthreads and the libbzip2 libraries. I’m not entirely sure how one runs the binaries, but presumably you need to launch it using mpirun or mpiexec – or however your MPI implementation works – although the README indicates that you just run mpibzip2. Caveat emptor.

A third parallel Bzip2 implements is lbzip2. I haven’t tested this tool either, but at first glance it appears to be a threaded Bzip2 implementation. As with Pbzip2 and Mpibzip2, it claims compatibility with Bzip2, which means that if I compress a file with one of the tools, it can be uncompressed by the serial Bzip2 (and vice versa).

Doug Eadline wrote a good article about parallel compression tools, in which he tests both Pigz and Pbzip2. He compares the serial Gzip and Bzip2 to a quad-core run of Pigz and Pbzip2 on a 1.1GB file (the test was done in 2011). He found that he got a 3.5 times speedup using four cores compared with a single core (pretty good scaling). However, don’t always expect that sort of scaling. Parallel computations are all about moving the bottleneck around. In Doug’s case, the algorithm was the bottleneck, and he used tools that could take advantage of all of the cores, which might, however, move the bottleneck to the underlying storage. In Doug’s case, this does not appear to be the case, but you have to pay attention to where the next bottleneck lies.

If Gzip and Bzip2 are not your cup of tea, you have alternatives, such as the 7zip compression tool. The parallel version of the tool is p7zip. I haven’t built this code, but it appears to come with 7zip as well. The documentation is a little weak, but I’m guessing it uses threads for parallelization and achieves compression by breaking the file into chunks and compressing them in parallel.

The last tool I want to mention is called pixz, a parallel implementation of the xz compression tools. Although I haven’t tested it, xz supposedly produces the highest levels of compression, but probably at the cost of time (it takes longer to create) and computational resources (CPU and memory). Pixz follows the same pattern as the other parallel compression tools – it breaks the file into pieces and compresses each piece in parallel.

Before finishing the section on parallel compression tools, I want to point you to a good article that talks about how to make parallel tools the default on your system. This article, really a Q&A, gives a couple of solutions for using the parallel tools as defaults on the system, particularly when using tar; my personal favorite is aliasing the tar command:

alias tar='tar --use-compress-program=pigz'

This just adds the --use-compress-program option to the normal tar command. I like this approach rather than symlinks, because if I needed to go back to the serial version for some reason, I just change the alias in my .bashrc file and source it.

Parallel Tar

Speaking of tar, I looked high and low for some sort of parallel Tar project. The only possibility I found was mtwrite, which is not really parallel tar code, but rather a “helper” library. It intercepts the write calls from the normal tar command and spawns worker threads for those files. This allows unchanged tar commands to run in parallel. I have not tested this code, only read about it, but if you are daring enough, you might take a look.

Parallel Commands/Shells

One of the first things people developed for clusters was a parallel shell or parallel command. The idea is that you type a command once but it is executed on all of the nodes. Although these tools are very common in clusters, they are not very common in the enterprise world. Administrators use them very frequently for many tasks, and believe it or not, sometimes users can take advantage of them for various tasks.

Probably the most ubiquitous parallel shell is pdsh. It is easy to install and use. You can define the hosts or nodes you want to use in a file and then define an environment variable, WCOLL, that points to that file. For example, the file could look like:

% cat /home/laytonjb/WCOLL

Once WCOLL is defined, you can run the same command over all of the nodes defined in WCOLL:

% pdsh uname -r

You can specify additional hosts or exclude hosts on the pdsh command line. One key thing to be aware of is that you need to be able to access other hosts without a password (an entirely different topic). A reasonable place to learn more is at the pdsh page at Google Code.

Other parallel shell tools are PuSSH, a Python wrapper script to execute ssh on a number of hosts; dsh, which is on the older side, with the last update in 2007, although a Python version of dsh, called PyDSH, is fairly recent (2012); and the SSH power tool sshpt, which also is written in Python and is run as you would a typical ssh command (the syntax is pretty close).

The final parallel shell tool I want to mention is actually a “set” of parallel SSH tools. The package, generically called pssh, provides parallel versions of several tools: pssh (parallel ssh), pscp (parallel scp), prsync (parallel rsync), pnuke (parallel nuke), and pslurp (parallel slurp). The tool is a little older – it was updated in 2008 – but it could be an interesting experiment.


I don’t know about you, but I like speed. I’ve got lots of cores on my systems that aren’t being used when I run typical commands. Moreover, some applications are really a series of scripts that use common shell commands, for which I would like to increase the speed. I hope the tools mentioned in this article will act as a starting point for your exploration. In the meantime, please excuse me while I pigz a few TAR files on my desktop.