Lead Image © Tom De Spiegelaere, 123RF.com

Lead Image © Tom De Spiegelaere, 123RF.com

Using benchmarks to your advantage

Node Check

Article from ADMIN 26/2015
By
A collection of single- and multinode performance benchmarks is an excellent place to start when debugging a user's application that isn't running well.

From my perspective as a user, customer, developer, and administrator on one hand and as a vendor on the other, one of the most contentious issues from both sides in the HPC industry has been benchmarks.

As a user and customer, I used benchmarks to get an idea of performance and to compare product metrics, such as performance/price or performance/watt. However, benchmarks require time and effort, both to create them and to interpret the results, delaying the request for proposal while vendors take the time to produce benchmarks, and thus delaying the introduction of a new system. Moreover, after the system is installed, the requested benchmarks are often re-run to make sure it meets the vendor's guarantees. This again delays putting the system into production. On the vendor side, I used benchmarks to improve my understanding of how new systems performed, so I could make good recommendations to customers. It also helped me explain to customers how much work would be needed to port applications to these new systems.

To achieve this, standard benchmarks and commercial applications are run on the new systems and the results are published in a series of articles and blog posts. Any customer-specific benchmarks typically took a great deal of work. Because of the enormous amount of effort required in this process, both sides – customer and vendor – view benchmarks as a necessary evil. Neither side really wants them; nonetheless, they use them. That said, perhaps I can find a way to use them that isn't so evil. To begin this quest, I'll examine the benchmarks typically run when installing a system.

Installation Benchmarks

During installation, the system is reconstructed on the customer site, which includes racking and cabling the hardware and installing or checking the system software. Once the system is up and running, benchmarks are run to determine two things: Are the nodes and network functioning correctly? Is system performance as promised?

In my experience, to accomplish these two goals, you should run a series of benchmarks that start with single-node runs and progress to groups of nodes of various sizes.

Single-Node Runs

I like to start with the individual nodes and then work up, so I begin by running the exact same tests on all of the nodes as close to the same time as possible. The tests should run fairly quickly yet stress various components of the system. For example, they should definitely stress the processor(s) and memory, especially the bandwidth. I would recommend running single-core tests and tests that use all of the cores (i.e., MPI or OpenMP).

A number of benchmarks are available for you to run. The ones I like are the NAS Parallel Benchmarks (NPB) [1]. NPB is a set of benchmarks that cover a wide range of applications, primarily from the CFD (Computational Fluid Dynamics) field. I've found they really stress the CPU, memory bandwidth, and network in various ways. OpenMP and MPI versions of NPB "classes" allow you to run different data sizes. Plus, they are very easy to build and run, and the output is easy to interpret.

The NASA website [2] provides the following details on the NPB benchmarks.

Five kernel benchmarks:

  • IS – Sorts small integers using the bucket sort. Typically uses random memory access.
  • EP – Embarrassingly parallel application. Generates independent Gaussian random variates using the Marsaglia polar method.
  • CG – Estimates the smallest eigenvalue of a large sparse symmetric positive-definite matrix using the inverse iteration with the conjugate gradient method as a subroutine for solving systems of linear equations. Uses irregular memory access and communication.
  • MG – Approximates the solution of a three-dimensional discrete Poisson equation using the V-cycle multigrid method on a sequence of meshes. Exhibits both long- and short-distance communication and is memory intensive.
  • FT – Solves a three-dimensional partial differential equation (PDE) using the fast Fourier transform. Uses a great deal of all-to-all communication.

Three pseudo-applications:

  • BT – Solves a synthetic system of nonlinear PDEs using a block tri-diagonal solver.
  • SP – Solves a synthetic system of nonlinear PDEs using a scalar penta-diagonal solver.
  • LU – Solves a synthetic system of nonlinear PDEs using symmetric successive over-relaxation (SSOR). Also referred to as a Lower-Upper Gauss--Seidel solver.

These tests have both OpenMP and MPI versions, and a "multizone" version of the pseudo-applications can be run in a hybrid mode (i.e., MPI/OpenMP).

The benchmark classes in Table 1 indicate the size of the problem being examined and correlate with the amount of memory used and the amount of time needed to complete.

Table 1

NPB Benchmark Classes

Class Test Size Application
S Small Quick tests
W Workstation From the 1990s
A, B, C Standard 4x size increases going from one class to the next
D, E, F Large ~16x size increases from each of the previous classes

NPB has been released three times, each undergoing several versions as bugs were found or improvements were introduced. As of this writing, the latest version is 3.3.1 for both NPB and NPB-MZ (multizone).

Benchmark results are usually expressed in terms of how much (wall clock) time it takes to run and in GFLOPS (10^9 floating point operations per second) or MFLOPS (10^6 floating point operations per second). For example, Listing 1 presents the output of the MG benchmark (NPB 3.3.1, GCC compilers, OpenMPI, single socket with four cores with four hyperthreading cores for eight total cores, Class C).

Listing 1

MG Benchmark Output

 NAS Parallel Benchmarks 3.3 -- MG Benchmark
 No input file. Using compiled defaults
 Size:  512x 512x 512  (class C)
 Iterations:   20
 Number of processes:      8
 Initialization time:           5.245 seconds
  iter    1
  iter    5
  iter   10
  iter   15
  iter   20
 Benchmark completed
 VERIFICATION SUCCESSFUL
 L2 Norm is  0.5706732285739E-06
 Error is    0.1345119360807E-12
 MG Benchmark Completed.
 Class           =                        C
 Size            =            512x 512x 512
 Iterations      =                       20
 Time in seconds =                    35.44
 Total processes =                        8
 Compiled procs  =                        8
 Mop/s total     =                  4393.44
 Mop/s/process   =                   549.18
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              28 Nov 2014
...

The output says it took 35.44 seconds to run, using a total of 4,393.44 MOPS (Mop/s in Listing 1, million operations per second; =4.393 GFLOPS).

For testing (benchmarking), I select a subset of the NPB benchmarks and classes, execute single-node runs (either OpenMP or MPI) on all of the nodes roughly at the same time, and name the output files to match the node name. To collect the output from all of the runs, I use simple Bash or Python scripts.

With this data in hand, I first look for performance outliers. To begin, I compute the average (arithmetic mean) and standard deviation of all of the results for each test. If the standard deviation is a significant percentage of the average, I then plot the data on a graph of performance versus node number, which I inspect visually for outliers.

From the plot, I can mark some nodes as outliers that need to be re-tested and possibly triaged. Next, I remove the data of the outlier nodes from the totals and recompute the average and standard deviation, repeating the outlier identification process. At some point, one hopes the standard deviation becomes a small percentage of the average, so I can stop the testing process with a set of good nodes and a set of outlier nodes.

For example, I might start with a performance standard deviation target of +/-5 percent of the average. (Note that 5 percent is an example, not a hard and fast number.) If the computed standard deviation is greater than 5 percent, I will plot the results and start choosing nodes outside of this deviation. Next, I recompute the average and standard deviation of the reduced set and repeat until I reach the target 5 percent deviation.

With the set of outlier nodes, I re-run the benchmarks one or two more times to see if the performance changes. If it does not, then I triage the nodes (up to and including replacement).

The last step is probably one of the most critical steps you can take, and it goes to the heart of this article. Be sure to store the single-node results somewhere you can easily retrieve them. Also, store the the source, and even the binaries, with the information on how you built the code, including software versions.

Small Node Groups

After the single-node runs are done, I test small groups of nodes. You can either arbitrarily pick the number of nodes per group to test, or you can group the nodes together so that they all belong to a single switch. Generally, I try to run four nodes per group to keep things simple. In these groups, I run tests with both a single core per node and all the cores per node, allowing me to stress the nodes in different ways. The goal of small-node-group testing is to start introducing network performance as an overall parameter. For these runs, you have to use the MPI version of the NPB tests, and I would run the same tests as used in the single-node runs.

I recommend running two different classes for these small node groups, beginning with A or B, to stress the network by taking a small problem and spreading it across a number of processes. However, real systems are seldom run this way, because it is not an efficient use of the system. Therefore, I would also run the largest class problem possible to stress the memory, CPU, and network.

After running these tests, you again perform a statistical analysis on the results in the exact same manner as described for the single-node runs: compute the average and standard deviation of the tests, look for outliers in the data, run more tests on those groups, and perhaps triage the nodes if needed. I would also recommend comparing the nodes in this outlier groups to the outliers in the single-node tests to look for correlation.

As with the single-node tests, be sure to store the results somewhere you can easily retrieve them, along with the source and binaries and how you built the code, including versions.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Benchmarks Don’t Have to Be Evil

    Benchmarks have been misused by both users and vendors for many years, but they don’t have to be the evil creature we all think them to be.

  • ClusterHAT

    Inexpensive, small, portable, low-power clusters are fantastic for many HPC applications. One of the coolest small clusters is the ClusterHAT for Raspberry Pi.

  • Favorite benchmarking tools
    We take a look at three benchmarking tool favorites: time, hyperfine, and bench.
  • HPC Software Road Gets a Bit Smoother

    Introduction of the new OpenMP specification abstracts away many of the thorny issues associated with today’s HPC hardware.

  • Finding Memory Bottlenecks with Stream

    One of the key bottlenecks for HPC application performance is memory bandwidth: literally, how fast you can get data from memory to the processor and back. A convenient microbenchmark named Stream measures the memory bandwidth of nodes and reveals a general trend over the last six years that might surprise you.

comments powered by Disqus