Virtuous Benchmarks: Using Benchmarks to Your Advantage
From my perspective as a user, customer, developer, and administrator on the one hand and as a vendor on the other, one of the most contentious issues from both sides in the HPC industry has been benchmarks.
As a user and customer, I used benchmarks to get an idea of performance and to compare product metrics, such as performance/price or performance/watt. However, benchmarks require time and effort, both to create them and to interpret the results, delaying the request for proposal while the vendors take the time to produce the benchmarks and thus delaying the introduction of a new system. Moreover, after the system is installed, the requested benchmarks are often re-run to make sure it meets the vendor’s guarantees. This again delays putting the system into production.
On the vendor side, I used benchmarks to improve my understanding of how new systems performed, so I could make good recommendations to customers. It also helped me explain to customers how much work would be needed to port applications to these new systems. To achieve this, standard benchmarks and commercial applications are run on the new systems and the results are published in a series of articles and blog posts. Additionally, customer-specific benchmarks typically took a great deal of work.
Because of the enormous amount of effort required in this process, both sides – customer and vendor – view benchmarks as a necessary evil. Neither side really wants them; nonetheless, they use them. That said, perhaps I can find a way to use them that isn’t so evil. To begin this quest, I’ll examine the benchmarks typically run when installing a system.
Installation Benchmarks
During installation, the system is reconstructed on the customer site, which includes racking and cabling the hardware and installing or checking the system software. Once the system is up and running, benchmarks are run to determine two things: Are the nodes and network functioning correctly? Is system performance as promised?
In my experience, to accomplish these two goals, you should run a series of benchmarks that start with single-node runs and progress to groups of nodes of various sizes.
Single-Node Runs
I like to start with the individual nodes and then work up, so I begin by running the exact same tests on all of the nodes as close to the same time as possible. The tests should run fairly quickly yet stress various components of the system. For example, they should definitely stress the processor(s) and memory, especially the bandwidth. I would recommend running single-core tests and tests that use all of the cores (i.e., MPI or OpenMP).
A number of benchmarks are available for you to run. The ones I like are the NAS Parallel Benchmarks (NPB). NPB is a set of benchmarks that cover a wide range of applications, primarily from the CFD (Computational Fluid Dynamics) field. I’ve found they really stress the CPU, memory bandwidth, and network in various ways. OpenMP and MPI versions of NPB “classes” allow you to run different data sizes. Plus, they are very easy to build and run, and the output is easy to interpret.
The NASA website provides the following details on the NPB benchmarks.
- Five kernel benchmarks:
- IS – Sort small integers using the bucket sort. Typically uses random memory access.
- EP – Embarrassingly parallel application. Generates independent Gaussian random variates using the Marsaglia polar method.
- CG – Estimate the smallest eigenvalue of a large sparse symmetric positive-definite matrix using the inverse iteration with the conjugate gradient method as a subroutine for solving systems of linear equations. Uses irregular memory access and communication.
- MG – Approximate the solution of a three-dimensional discrete Poisson equation using the V-cycle multigrid method on a sequence of meshes. Exhibits both long- and short-distance communication and is memory intensive.
- FT – Solve a three-dimensional partial differential equation (PDE) using the fast Fourier transform. Uses a great deal of all-to-all communication.
- Three pseudo-applications:
- BT – Solves a synthetic system of nonlinear PDEs using a block tri-diagonal solver.
- SP – Solves a synthetic system of nonlinear PDEs using a scalar penta-diagonal solver.
- LU – Solves a synthetic system of nonlinear PDEs using symmetric successive over-relaxation (SSOR). Also referred to as a Lower-Upper Gauss–Seidel solver.
These tests have both OpenMP and MPI versions, and a “multizone” version of the pseudo-applications can be run in a hybrid mode (i.e., MPI/OpenMP).
The benchmark classes in Table 1 indicate the size of the problem being examined and correlate with the amount of memory used and the amount of time needed to complete.
Table 1: NPB Benchmark Classes
Class | Test Size | Application |
S | Small | Quick tests |
W | Workstation | From the 1990s |
A, B, C | Standard | 4x size increases going from one class to the next |
D, E, F | Large | ~16x size increases from each of the previous classes |
NPB has been released three times, each undergoing several versions as bugs were found or improvements were introduced. As of this writing, the latest version is 3.3.1 for both NPB and NPB-MZ (multizone).
Benchmark results are usually expressed in terms of how much (wall clock) time it takes to run and in GFLOPS (10^9 floating point operations per second) or MFLOPS (10^6 floating point operations per second). For example, Listing 1 presents the output of the MG benchmark (NPB 3.3.1, GCC compilers, OpenMPI, single socket with four cores with four hyperthreading cores for eight total cores, Class C).
Listing 1: MG Benchmark Output
NAS Parallel Benchmarks 3.3 -- MG Benchmark No input file. Using compiled defaults Size: 512x 512x 512 (class C) Iterations: 20 Number of processes: 8 Initialization time: 5.245 seconds iter 1 iter 5 iter 10 iter 15 iter 20 Benchmark completed VERIFICATION SUCCESSFUL L2 Norm is 0.5706732285739E-06 Error is 0.1345119360807E-12 MG Benchmark Completed. Class = C Size = 512x 512x 512 Iterations = 20 Time in seconds = 35.44 Total processes = 8 Compiled procs = 8 Mop/s total = 4393.44 Mop/s/process = 549.18 Operation type = floating point Verification = SUCCESSFUL Version = 3.3.1 Compile date = 28 Nov 2014 ...
The output says it took 35.44 seconds to run, using a total of 4,393.44 Mop/s (4.393 GFLOPS).
For testing (benchmarking), I select a subset of the NPB benchmarks and classes, execute single-node runs (either OpenMP or MPI) on all of the nodes roughly at the same time, and name the output files to match the node name. To collect the output from all of the runs, I use simple Bash or Python scripts.
With this data in hand, I first look for performance outliers. To begin, I compute the average (arithmetic mean) and standard deviation of all of the results for each test. If the standard deviation is a significant percentage of the average, I then plot the data on a graph of performance versus node number, which I inspect visually for outliers.
From the plot, I can mark some nodes as outliers that need to be re-tested and possibly triaged. Next, I remove the data of the outlier nodes from the totals and recompute the average and standard deviation, repeating the outlier identification process. At some point, one hopes the standard deviation becomes a small percentage of the average, so I can stop the testing process with a set of good nodes and a set of outlier nodes.
For example, I might start with a performance standard deviation target of +/-5% of the average. (Note that 5% is an example, not a hard and fast number.) If the computed standard deviation is greater than 5%, I will plot the results and start choosing nodes outside of this deviation. Next, I recompute the average and standard deviation of the reduced set and repeat until I reach the target 5% deviation.
With the set of outlier nodes, I re-run the benchmarks one or two more times to see if the performance changes. If it does not, then I triage the nodes (up to and including replacement).
The last step is probably one of the most critical steps you can take, and it goes to the heart of this article. Be sure to store the single-node results somewhere you can easily retrieve them. Also, store the the source, and even the binaries, with the information on how you built the code, including software versions.