Lead Image © Lucy Baldwin, 123RF.com

Lead Image © Lucy Baldwin, 123RF.com

Thrashing the data cache for fun and profit

Ducks in a Row

Article from ADMIN 54/2019
By
To reduce the possibility of disk thrashing, optimize your data cache through spatial locality and cache-friendly algorithms and utilize fully the data loaded from slower storage media before moving on.

In the past [1], I discussed how instruction cache misses can completely ruin your CPU's day when it comes to execution speed by stalling processing for thousands or even millions of cycles while program text is being retrieved from slower caches, or even from (gosh!) RAM [2] [3]. Now the time has come to turn your attention to the data cache: Where large enough datasets are involved, spatial locality and cache-friendly algorithms are among the first optimizations you should employ.

Three Times Slower

Compare the code presented in Listings 1 and 2. You might use diff [4] to verify that the only difference is in the order of the i and j variables in the array lookup. Listing 1 increments 100 million integers, scanning the array row by row, whereas Listing 2 increments the 10,000 elements of a column before moving to the next column.

Listing 1

row.c

 #include <stdio.h>
 #define SIZE 10000
  **
 int array[SIZE][SIZE] = {0};
 int main() {
   for (int i = 0; i < SIZE; i++)
     for (int j = 0; j < SIZE; j++)
       array[i][j]++;
   return 0;
 }

Listing 2

column.c

#include <stdio.h>
 #define SIZE 10000
  **
 int array[SIZE][SIZE] = {0};
 int main() {
   for (int i = 0; i < SIZE; i++)
     for (int j = 0; j < SIZE; j++)
       array[j][i]++;
   return 0;
 }

Both listings do exactly the same thing, proceeding in what is called row-major (Listing 1) or column-major (Listing 2) order [5]. To see how this plays out in practice, compile both programs:

gcc column.c -o column
gcc row.c -o row

Next, run and time the execution of these logically equivalent access strategies:

$ time ./row
real    0m1.388s
user    0m0.943s
sys     0m0.444s
 **
$ time ./column
real    0m4.538s
user    0m4.068s
sys     0m0.468s

The experiment shows the column-major code taking three and a half times longer to execute than its row-major counterpart. That significant result is partially explained by how C, C++, and other languages "flatten" two-dimensional arrays into the one-dimensional address space exposed by RAM.

As shown in Figure 1, individual rows line up sequentially in memory, something that can be easily shown with a simple program printing the dereferenced value of an incrementing pointer (Listing 3). This new program uses C pointer arithmetic (don't try this at home, as the MythBusters would say, because it is generally considered poor practice) to access and print every location in a small 2D array by linearly accessing memory. C really represents two-dimensional arrays as arrays of arrays, and lays them out in row-major fashion (i.e., row by row in memory).

Figure 1: Flattening a 2D array in C or C++.

Listing 3

inspect.c

 #include <stdio.h>
  **
 int a[4][5] = {             // array of 4 arrays of 5 ints each, a 4x5 matrix
     { 1, 2, 3, 4, 5 },      // row 0 initialized to 1, 2, 3, 4, 5
     { 6, 7, 8, 9, 10 },     // row 1 initialized to 6, 7, 8, 9 ...
 };                          // rows 2 onwards  initialized to zeros
  **
 int main() {
   int *p = a;
   for(int i=0; i < 20; i++)
     printf("%d, ", *p++);
   printf("\b\b \n");       // erasing the trailing comma and adding a newline
   return 0;
 }

Running the third program,

$ ./inspect
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$

indeed produces the output one would expect.

The Right Neighborhood

With every row of the array weighing in at roughly 40KB (10,000 entries of 32 bits), the difference in access time reveals the nature of the underlying machine. Accessing the array in the natural order of the programming language (implementation choices may differ here, with Fortran and Matlab choosing column-major layouts) leverages the spatial locality of an access pattern, reading locations contiguous in physical memory.

Conversely, using the wrong strategy produces code constantly looking up memory locations that are 40KB apart. The first approach is cache-friendly, whereas the latter approach repeatedly evicts cache lines that were just loaded yet used to process but a single value. The more linear the memory access pattern, the easier it is for the CPU to be ready for operations, with data already in the cache when called.

Memory layout and access patterns matter, and a stark demonstration of the difference between the two approaches is provided by Figure 2. The perf stat command [6] executes the command requested while gathering statistics from the CPU's performance counters. Switching between the row-major and column-major access patterns resulted in a 20-fold increase in L1 cache misses, spiking from 13 million to 273 million!

Figure 2: CPU counters for data cache misses.

The Cachegrind module of valgrind [7] measures how a program interacts with an ideal cache hierarchy by simulating a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). Although Cachegrind requires tuning to match the size of a CPU's caches, the independence of its simulations from any other process running in the system, reproducible results, and compatibility with virtual environments makes it the tool of choice when studying a cache performance issue.

Figure 3 shows the results of the run, producing a 16-fold increase in data cache misses, up from 6 million to 100 million events. Note how the row-major approach delivered a near-perfect result, with D1 cache misses coming in under one percent.

Figure 3: Simulated CPU cache miss rates.

Fully utilizing the data loaded from the slower storage medium (RAM in this case) is a common optimization pattern. Swapping to disk is, of course, the worst case scenario, with 13,700 times the latency of the L1 cache. The situation whereupon memory pages are being offloaded to the swap device after a single variable access and then reloaded to access the next location is known as disk thrashing.  RAM is loaded by the CPU in cache lines [8], and data is loaded from disk in full pages; whenever possible, make full use of all their data before moving on to the next.

Infos

  1. "Heavy Rotation" by Federico Lucifredi, ADMIN , issue 14, 2013, pg. 88
  2. Visualization of L1, L2, RAM, and disk latencies: https://news.ycombinator.com/item?id=702713
  3. Interactive visualization of cache speeds: http://www.overbyte.com.au/misc/Lesson3/CacheFun.html
  4. diff (1) man page: https://manpages.ubuntu.com/manpages/bionic/en/man1/diff.1.html
  5. Row- and column-major order: https://en.wikipedia.org/wiki/Row-_and_column-major_order
  6. perf stat (1) man page: http://manpages.ubuntu.com/manpages/bionic/man1/perf-stat.1.html
  7. Cachegrind manual: http://valgrind.org/docs/manual/cg-manual.html
  8. CPU cache: https://en.wikipedia.org/wiki/CPU_cache

The Author

Federico Lucifredi (@0xf2) is the Product Management Director for Ceph Storage at Red Hat and was formerly the Ubuntu Server Project Manager at Canonical and the Linux "Systems Management Czar" at SUSE. He enjoys arcane hardware issues and shell-scripting mysteries and takes his McFlurry shaken, not stirred. You can read more from him in the new O'Reilly title AWS System Administration .

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Introduction to HDF5

    HDF5 is a data model, library, and file format used primarily for scientific computing to store and manage data.

  • Review: Accelerator card by OCZ for ESX server
    I/O throughput is a classic bottleneck, especially in virtualized environments. A flash cache card with matching software from OCZ promises to open up wide. We tested it.
  • Pandas: Data analysis with Python
    The Python Data Analysis Library, or Pandas, is built on top of the fast math library NumPy and makes analysis of large volumes of data an easy and efficient experience.
  • Data Analysis with Panda

    The Python Data Analysis Library, or Pandas, is built on top of the fast math library NumPy and makes analysis of large volumes of data an easy and efficient experience.

  • HDF5 and Parallel I/O

    In the last of three articles on HDF5, we explore performing parallel I/O with the use of HDF5 files.

comments powered by Disqus