With directive coding, you annotate code with compiler directives to take advantage of parallelism or accelerators. The two primary standards are OpenACC and OpenMP.

 

Directive Coding

 

Ellie Arroway: You found the primer.
S. R. Hadden: Clever girl! Lights. … Pages and pages of data. Over 63 thousand in all, and on the perimeter of each …
Ellie Arroway: …alignment symbols, registration marks, but they don’t line up.
S. R. Hadden: They do, if you think like a Vegan. An alien intelligence is going to be more advanced. That means efficiency functioning on multiple levels and in multiple dimensions.
Ellie Arroway: Yes! Of course. Where is the primer?
S. R. Hadden: You'll see. Every three-dimensional page contains a piece of the primer; there it was all the time, staring you in the face. Buried within the message itself, is the key …
 Contact (1997)

I've always loved this interaction from the movie Contact. To me, it illustrates that you constantly have to think about all aspects of a problem and can't focus on just one thing too long. You have to spin the problem, turn it around, and turn it inside out to understand the problem and solve it – or at least take a step in that direction.

Two current trends in the HPC world are intersecting: using more than one core and using accelerators. They can result in lots of opportunities, but sometimes you need to turn a problem around and examine it from a different perspective to find an approach to solving the problem that comes from the intersection of possible solutions.

In the HPC world, opportunities can mean faster performance (it does stand for “high performance,” after all), easier or simpler programming, more portable code, or perhaps all three. To better understand the interaction, I’ll examine the trend of helping coders get past using only a single core.

Using More than One Core

XSEDE is an organization, primarily of colleges and universities, that integrates resources and services, mostly around HPC, and makes them easier to use and share. XSEDE’s services and tools allow the various facilities to be federated and shared. You can see a list of the resources on their web site. It's definitely not a small organization, having well over 12x1015 floating-point operations per second (12PFLOPS) of peak performance in aggregate.

At the recent XSEDE conference during a panel session, it was stated that 30% of the jobs run through XSEDE in 2012 only used a single core. From another presentation, only 70% used 16 cores (about a single node). There at least has got to be an easy way to accelerate the single-core jobs to use a good percentage of the cores in a single node. Perhaps some simple “hints” or directives can be added to code to tell the compiler that it can create a binary that takes advantage of all of the computational capability available.

Accelerators

At the same time, HPC has had an insatiable appetite for more performance. CPUs have evolved to include several tiers of cache from L1, to L2, to L3 (and even L4) before going to main memory. Current CPUs also have several cores per processor. Although CPU improvements have brought wonderful gains in performance, the desire for even more performance has been strong, leading to the adoption of co-processors, which take on some of the computational load to improve application performance. These co-processors, also referred to as “accelerators,” can take many shapes: GPUs, many-core CPU processors like Intel’s Xeon Phi, digital signal processors (DSPs), and even floating-point gate arrays (FPGAs).

All of this hardware has been added in the name of better performance. Meanwhile, applications and tools have evolved to take advantage of the extra hardware, with applications using OpenMP to utilize the hardware on a single node or MPI to take advantage of the extra processing power across independent nodes.

Certain applications or parts of applications can be re-written to use these accelerators, greatly increasing their performance. However, each accelerator has its own unique properties, so applications have to be written for that specific accelerator. How about killing two birds with one technology? Is it possible to have a simple way to help people write code that uses multiple cores or accelerators or both? Wouldn’t it be nice to have compiler directives that tell compilers what sections of code could be built for accelerators, including extra CPU cores, and build the code for the targeted accelerator? Turns out a couple of directives are available.