One of the key bottlenecks for HPC application performance is memory bandwidth: literally, how fast you can get data from memory to the processor and back. A convenient microbenchmark named Stream measures the memory bandwidth of nodes and reveals a general trend over the last six years that might surprise you.

Benchmarking Memory Bandwidth

After a while, everyone in HPC realizes that as technology changes, we're just moving performance bottlenecks to different places in the system as a whole. We'll never get a perfect system that is infinitely fast and infinitely scalable, with more memory than we can possibly use and that doesn't use much power, is inexpensive, and … well, you get the picture. Various aspects of the system will always have built-in limits. The secret to a good design is understanding your applications and knowing the performance of key aspects of the system.

 

One performance aspect that people have been yelling about for some time is memory bandwidth: how quickly data can be written to or read from memory by the processor. This driver of application performance is very important because it affects how quickly the OS can get data into and out-of memory for processing. If memory bandwidth is low, then the processor could be waiting on memory to retrieve or write data. If memory bandwidth is high, then the data needed by the processor can easily be retrieved or written.

 

Although not every application has different memory/data access patterns for which performance is driven by memory bandwidth, you might be surprised by the number applications for which this is true. For example, a class of “unstructured” computational fluid dynamics (CFD) applications access data in an array in a non-sequential way [e.g., assume an array A that the CFD code accesses in a loop non-sequentially as A(8), A(143), A(32), A(56), A(10678), and A(2004)]. The performance of this application is driven by how quickly the values from A can be obtained (reading from memory). Once the computation is finished, a certain element of A might also be updated (writing to memory). All of this is done in a loop using all of the values in array A. One of the drivers of this computation will be how fast the data can be retrieved from and written to memory (i.e., memory bandwidth).

 

Stream Memory Bandwidth

 

One of the most commonly used benchmarks in all of HPC-dom is Stream, a synthetic benchmark that measures sustainable memory bandwidth for simple computational kernels. Table 1 lists the four benchmarks that compose Stream.

 

Table 1: Stream Benchmarks

 

Name Kernel Bytes/Iteration FLOPS/Iteration
COPY a(i) = b(i) 16 0
SCALE a(i) = q*b(i) 16 1
SUM a(i) = b(i) + c(i) 24 1
TRIAD a(i) = b(i) + q*c(i) 24 2

 

The Copy benchmark measures the transfer rate in the absence of arithmetic. This should be one of the fastest memory operations, but it also represents a common one – fetching two values from memory, a(i) and b(i), and update one operation.

 

The Scale benchmark adds a simple arithmetic operation to the Copy benchmark. This starts to simulate real application operations. The operation fetches two values from memory, a(i) and b(i), but operates on b(i) before writing it to a(i). It's a simple scalar operation, but more complex operations are built from it, so the performance of this simple test can be used as an indicator of the performance of more complex operations.

 

The third benchmark, the Sum benchmark, adds a third operand and was originally written to allow multiple load/store ports on vector machines to be tested when vector machines were in vogue. However, this benchmark is very useful today because of the large pipelines that some processors possess. Rather than just fetch two values from memory, this micro-benchmark fetches three. For larger arrays, this will quickly fill a processor pipeline, so you can test the memory bandwidth filling the processor pipeline or the performance when the pipeline is full. Moreover, this benchmark is starting to approximate what some applications will perform in real computations.

 

The fourth benchmark in Stream, the Triad benchmark, allows chained or overlapped or fused, multiple-add operations. It builds on the Sum benchmark by adding an arithmetic operation to one of the fetched array values. Given that fused multiple-add operations (FMA) are an important operation in many basic computations, such as dot products, matrix multiplication, polynomial evaluations, Newton’s method for evaluation functions, and many DSP operations, this benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely important – hence, the usefulness of the Triad memory bandwidth benchmark.

 

Which of the four memory bandwidth tests from Stream are most important? That's not an easy question to answer beyond the classic response of, "it depends." To find out for your application, you need to get "down and dirty" to determine which one, or more, of the memory bandwidth micro-benchmarks is a good indicator of performance. Honestly, I think it's difficult to do, and I've never been too successful with this approach.

 

I have seen people try to take the four results and create some sort of average or mean (e.g., a geometric mean). This information can be misleading because the four results can be skewed relative to one another, despite efforts to account for it via the geometric mean. Sometimes, people ask for all Stream results, as well as the geometric mean, with the idea of comparing the performance various platforms with the use of a single metric. Personally, I like to see all four benchmark results, and when in doubt, I tend to focus just on the Triad result.

 

On the Stream website, you will see a couple of versions of the benchmark: a Fortran version from 2005 and a C version that was updated in 2013. The C program is typically the version used; you can run it as a serial application or threaded using OpenMP, usually using all of the cores on the node to allow you to see the best overall memory bandwidth. I won't discuss how to build Stream and run it because the web hasplenty of OpenMP tutorials.

 

There are two variables or definitions in the code that you should pay attention to. The first is STREAM_ARRAY_SIZE. This is the number of array elements used to run the benchmarks. In the current version, it is set to 10,000,000, which the code states should be good enough for caches up to 20MB. The Stream FAQ recommends you use a problem size such that each array is four times the sum of the caches (L1, L2, and L3). You can either change the code to reflect the array sizes you want, or you can set the variable when compiling the code.

 

The second variable you might want to change is NTIMES, the number of times each benchmark is run. By default, Stream reports the "best" result for any iteration after the first; therefore, be sure always to set NTIMES at least to 2 (10 is the default). This variable can also be set during compilation without changing the code.

 

The benchmark runs very fast unless you make STREAM_ARRAY_SIZE and NTIMES really large. Given the large number of cores on today's systems, you can try different core counts and see how the overall memory bandwidth varies. Another option, given that most processors have memory controllers in the processor, is to put all of the Stream processes (threads) on one core until you reach the core count for that processor. This can tell you the best memory performance from the memory controller, which could be useful information.

 

HPC Time Machine

 

I think it's worthwhile to review memory bandwidth as measured by Stream over the last six years or so. I will be focusing on Intel processors because they have undergone quite a number of changes during this time; however, I also will throw in a couple of interesting results to illustrate the wide variation in memory bandwidth.

 

Table 2 lists the processor; the number of cores per node, assuming a dual-socket node; the number of cores per socket; the total Triad memory bandwidth found by using all the cores on the node; and the Triad memory bandwidth per core (total Triad memory bandwidth divided by the number of cores per node).

 

Table 2: Triad Memory Bandwidth (BW) Results – Total and per Core

 

Processor No. of Cores/Node (Sockets) Total Memory BW (GBps) Memory BW/core (GBps)
Harpertown (2007) 8 (4) 7.2 0.9
Nehalem-EP (2009) 8 (4) 32 4
Westmere-EP (2010) 12 (6) 42 3.5
Westmere-EP (2010) 8 (4) 42 5.25
Sandy Bridge EP (2012) 16 (8) 78 4.88
Sandy Bridge EP (2012) 12 (6) 78 6.5
Sandy Bridge EP (2012) 8 (4) 78 9.75
Ivy Bridge EP (2013) 24 (12) 101 4.21
Ivy Bridge EP (2013) 20 (10) 101 5.05
Ivy Bridge EP (2013) 16 (8) 101 6.31
Ivy Bridge EP (2013) 12 (6) 101 8.42
Haswell EP (guess) 32 (16) 120 3.75
Haswell EP (guess) 24 (12) 120 5
Raspberry Pi v1 1 0.25 0.25
Raspberry Pi v2 1 0.26 0.26

 

 

Notice that some of the processors have more than one core count. Also notice that I included a couple of Raspberry Pi results at the very end. This provides a comparison at the low end of the memory bandwidth scale. The values for Haswell at the end are taken from documents on the web. I have no idea whether they are accurate or not, but I include them for comparison.

 

What is interesting to me is the progression of memory bandwidth per core with time. The old Harpertown processors used a front-side bus (FSB), so the memory controller was not in the processor. I believe the AMD Opteron debuted this feature in x86 processors. Using the old FSB, the memory bandwidth per core was only 0.9GBps/core. Compared to this, the Raspberry Pi v2 has about one fourth of that memory bandwidth. For US$ 49, I get about one fourth the memory bandwidth of the old Harpertown processors. This sort of puts things into perspective, I think.

 

With the Nehalem processor, Intel put the memory controller in the processor, and you can see the huge jump in memory bandwidth. The per core memory bandwidth for Nehalem is 4.44 times better than Harpertown, reaching about 4.0GBps/core. I think this is one of the major reasons the Nehalem processor was so popular and successful.

 

If you track memory bandwidth per core using the maximum number of cores per socket, then the results look something like this:

 

  • 4.0 (Nehalem)
  • 3.5 (Westmere)
  • 4.875 (Sandy Bridge)
  • 4.210 (Ivy Bridge)
  • 4.0 (Estimated Haswell performance)

 

If you know Intel's Tick-Tock processor model you can see that the "tick" is a die-shrink and the "tock" is a new processor design. During the tick, more cores are usually added, which gives more processing power but also reduces the memory bandwidth per core.

 

For example, when you go from Nehalem to Westmere, the number of cores per socket went from four to six (50% more cores), but the memory bandwidth per core dropped by 12.5 percent (4.0 to 3.5GBps). However, if you stayed with four cores per processor, the memory bandwidth per core actually went up to 5.25GBps/core. I know of some computer-aided engineering (CAE) applications for which the independent software vendor (ISV) recommends only using four cores if you have six-core processors. The reason is that the extra memory bandwidth per core improved application performance. However, this recommendation ideally needs to be viewed through the lens of cost per core or cost per run time for the application; then, one can make a determination as to whether this is a cost-effective solution.

 

The one aspect that concerns me is my estimated Haswell scenario. The rumors, and I have no idea if they are valid or not, is that memory speed will increase to 2,133MHz (DDR4 memory), and the number of memory channels stays the same as four per socket. However, the core count is projected to be 15 cores or 16 cores. If you assume no big changes in the memory controllers, then the ratio of memory speed for Ivy Bridge (1,866MHz) to Haswell (2,133MHz) plus a little extra, because Intel always gets a little more performance from the controllers, puts the total two-socket memory bandwidth at around 120GBps. That sounds wonderful until you realize that you can have up to 32 cores sharing that bandwidth. The result is only 3.75GBps/core for memory bandwidth, which is lower than the original Nehalem memory bandwidth per core. However, these are just estimates, and reality will answer all questions.

 

Summary

 

Memory bandwidth is one aspect of a system that people don't always think about. It's kind of there and you assume it's reasonable, but for some applications, performance is driven by memory bandwidth. In fact, a large number of applications need good memory bandwidth for good performance.

 

In the HPC world, the most common benchmark for measuring memory bandwidth is Stream. It has been around for a few years and is used quite often in RFPs or in simply understanding memory bandwidth of systems. Stream has four very simple micro-benchmarks that measure basic memory access patterns used in more complex patterns of real applications.

 

In the short history of memory bandwidth in Intel processors over the last six years, with close attention paid to memory bandwidth per core (because one likes to use all cores available), you could see that memory bandwidth per core varied with the "tick-tock" model. In general, the memory bandwidth trend is upward (i.e., more memory bandwidth per core over time), but not always. As a consequence, several ISV companies have recommend that their customers not use all of the cores on their processors, increasing the memory bandwidth per core used and presumably making overall performance better. Additionally, you should examine the effect on the overall performance per dollar or the performance per node to ascertain whether you have created a better system.