Why Good Applications Don't Scale

Speed Limits

You just bought a new system with lots and lots of cores (e.g., a desktop with 64 cores or a server with 128 cores). Now that you have all of these cores, why not take advantage of them by parallelizing your code? Depending on your code and your skills, you have a number of paths to parallelization, but after some hard work profiling and lots of testing, your application is successfully parallelized – and it gives you the correct answers! Now comes the real proof: You start checking your application's performance as you add processors.

Suppose that running on a single core takes about three minutes (180 seconds) of wall clock time. Cautiously, but with lots of optimism, you run it on two cores. The wall clock time is just about two and a half minutes (144 seconds), which is 80 percent of the time on a single processor. Success!

You are seeing parallel processing in action, and for some HPC enthusiasts, this is truly thrilling. After doing your "parallelization success celebration dance," you go for it and run it on four cores. The code runs in just over two minutes (126 seconds). This is 70 percent of the time on a single core. Maybe not as great as the jump from one to two cores, but it is running faster than a single core. Now try eight cores (more is better, right?). This runs in just under two minutes (117 seconds) or about 65 percent of the single core time. What?

Now it's time to go for broke and use 32 cores. This test takes about 110 seconds or about 61 percent of the single core time. Argh! You feel like Charlie Brown trying to kick the football when Lucy is holding it. Enough is enough: Try all 64 cores. The application takes 205 seconds. This is maddening! Why did the wall clock time go up? What's going on?