Directive Coding

By Jeff Layton

08/12/2015 08:37 pm

With directive coding, you annotate code with compiler directives to take advantage of parallelism or accelerators. The two primary standards are OpenACC and OpenMP.

Ellie Arroway: You found the primer.
S. R. Hadden: Clever girl! Lights. … Pages and pages of data. Over 63 thousand in all, and on the perimeter of each …
Ellie Arroway: …alignment symbols, registration marks, but they don’t line up.
S. R. Hadden: They do, if you think like a Vegan. An alien intelligence is going to be more advanced. That means efficiency functioning on multiple levels and in multiple dimensions.
Ellie Arroway: Yes! Of course. Where is the primer?
S. R. Hadden: You'll see. Every three-dimensional page contains a piece of the primer; there it was all the time, staring you in the face. Buried within the message itself, is the key …
– Contact (1997)

I've always loved this interaction from the movie Contact. To me, it illustrates that you constantly have to think about all aspects of a problem and can't focus on just one thing too long. You have to spin the problem, turn it around, and turn it inside out to understand the problem and solve it – or at least take a step in that direction.

Two current trends in the HPC world are intersecting: using more than one core and using accelerators. They can result in lots of opportunities, but sometimes you need to turn a problem around and examine it from a different perspective to find an approach to solving the problem that comes from the intersection of possible solutions.

In the HPC world, opportunities can mean faster performance (it does stand for “high performance,” after all), easier or simpler programming, more portable code, or perhaps all three. To better understand the interaction, I’ll examine the trend of helping coders get past using only a single core.

Using More than One Core

XSEDE is an organization, primarily of colleges and universities, that integrates resources and services, mostly around HPC, and makes them easier to use and share. XSEDE’s services and tools allow the various facilities to be federated and shared. You can see a list of the resources on their web site. It's definitely not a small organization, having well over 12x10¹⁵ floating-point operations per second (12PFLOPS) of peak performance in aggregate.

At the recent XSEDE conference during a panel session, it was stated that 30% of the jobs run through XSEDE in 2012 only used a single core. From another presentation, only 70% used 16 cores (about a single node). There at least has got to be an easy way to accelerate the single-core jobs to use a good percentage of the cores in a single node. Perhaps some simple “hints” or directives can be added to code to tell the compiler that it can create a binary that takes advantage of all of the computational capability available.

Accelerators

At the same time, HPC has had an insatiable appetite for more performance. CPUs have evolved to include several tiers of cache from L1, to L2, to L3 (and even L4) before going to main memory. Current CPUs also have several cores per processor. Although CPU improvements have brought wonderful gains in performance, the desire for even more performance has been strong, leading to the adoption of co-processors, which take on some of the computational load to improve application performance. These co-processors, also referred to as “accelerators,” can take many shapes: GPUs, many-core CPU processors like Intel’s Xeon Phi, digital signal processors (DSPs), and even floating-point gate arrays (FPGAs).

All of this hardware has been added in the name of better performance. Meanwhile, applications and tools have evolved to take advantage of the extra hardware, with applications using OpenMP to utilize the hardware on a single node or MPI to take advantage of the extra processing power across independent nodes.

Certain applications or parts of applications can be re-written to use these accelerators, greatly increasing their performance. However, each accelerator has its own unique properties, so applications have to be written for that specific accelerator. How about killing two birds with one technology? Is it possible to have a simple way to help people write code that uses multiple cores or accelerators or both? Wouldn’t it be nice to have compiler directives that tell compilers what sections of code could be built for accelerators, including extra CPU cores, and build the code for the targeted accelerator? Turns out a couple of directives are available.

OpenACC

The first set of compiler directives is called OpenACC. It was started by Cray, CAPS, Nvidia, and PGI (Portland Group) and is somewhat similar to OpenMP in that you annotate your code with a series of comments that the compiler interprets as directives to build sections of code for the accelerator. OpenACC originally targeted GPUs from Nvidia and AMD (ATI), but it being expanded to target other accelerators, possibly including other CPU cores (multicore processors) at some point.

OpenACC has two major versions: Version 1.0 of the standard, announced November 14, 2011, included a number of directives for coding accelerators (initially GPUs), and very quickly OpenACC compilers became quickly available. Version 2.0 was finalized on June 2013, although a preview was posted on November 12, 2012. It added some new capabilities and expanded functionality that had been learned since version 1.0 was released. Work is ongoing for the next OpenACC standard, as well.

As mentioned in the previously, OpenACC is a set of directives you add to your code as comments. An OpenACC compiler will interpret these as directives, but if the compiler is not OpenACC ready, it will think they are just comments and ignore them. This makes it easy to create portable code that can be used with a variety of compilers.

The directives cover a fairly large range of capability that tells compilers to create code that does a wide range of tasks, including:

Initiate the accelerator startup/shutdown
Manage program and/or data transfer between the CPU and the accelerator (Note: at this time OpenACC assumes that the memory in the accelerator is distinct from the CPU, requiring data transfer.)
Manage the work between the accelerator and the CPU

On the basis of the directives, the compilers generate the best code possible, but it is up to the programmer to tune their code to take advantage of the accelerators.

Because accelerators target parallel computations, you can imagine that application code sections that target accelerators include both coarse-grained and fine-grained parallelism. Coarse-grained parallelism allows multiple executions in the accelerator at the same time, whereas fine-grained parallelism includes threads of execution within a single execution unit, such as SIMD and vector operations. Moreover, accelerators are good candidates for work-sharing loops, or “kernel” regions, wherein one or more loops are executed as kernels. (Generically, a kernel is a small section of code.)

The syntax for directives is pretty simple. For Fortran, the directive looks like:

!$acc directive [clause [, clause] ...]

In Fortran (from free-format Fortran90 onward), ! is a comment. For C, the directive looks like:

#pragma acc directive [clause [, clause] ...]

OpenACC directives fall into several categories:

Accelerator parallel region/kernels directives
Loop directives
Data declaration directives
Data region directives
Cache directives
Wait/update directives
Environment variables

Although I won’t go through all of the directives and clauses, I’ll look at a couple to get a feel for what they look like and do.

The first directive or construct is a parallel construct. In Fortran, the code looks something like this:

!$acc parallel [clause [, clause] ...]
< structured code block >
!$acc end parallel

Notice that in Fortran you have to insert a directive that tells where the parallel region ends. In C, the code looks like this:

#pragma acc directive [clause [, clause] ...]
< structured code block >

This directive tells the compiler to create code where gangs of worker threads are executing the “structured code block” in parallel on the accelerator. [Note: In CUDA, a “gang of workers” is a CUDA block of threads.] One worker in each gang begins executing the code in the structured block. The number of gangs and the number of workers in each gang remain constant for the duration of the parallel region.

The second directive is a kernels directive. In Fortran, the directive looks like this:

!$acc kernels [clause [, clause] ... ]
< structured code block>
!$acc end kernels
< structured code block >

In C, the same thing looks like the following:

#pragma acc kernels [clause [, clause] ...]
< structured code block >

The kernels directive tells the compiler that the structured code block has a region that can be compiled into a sequence of kernels for execution on the accelerator. It is similar to the parallel directive, but the loops in the kernels will be independent kernels rather than one large kernel. These independent kernels and associated data transfers may overlap with other kernels.

A simple example of this directive (in Fortran) is:

!$acc kernels
    do i=1,n
       a(i) = 0.0
       b(i) = 1.0
       c(i) = 2.0
    end do
 
    do i=1,n
       a(i) = b(i) + c(i)
    end do
!$acc end kernels

With the simple kernels directive, the compiler creates a kernel from the first loop and a second kernel from the second loop (i.e., they are independent). These kernels can then be run on the accelerator.

You can find a lot of introductory materials on the web about OpenACC, as well as some YouTube videos that walk through OpenACC and examples.

OpenMP

OpenMP was the first set of directives developed that helped the compiler find regions of code that could be run in parallel on shared memory systems. The last bit, shared memory systems, is important. OpenACC handles data movement to/from accelerators that may have their own memory, whereas OpenMP has to use shared memory. Today's multicore and multisocket systems are shared memory, so that’s usually not an issue.

OpenMP started in 1997 with version 1.0 for Fortran. They release version 4.0 of the specification in July 2013. Version 4.0 has some directives to allow for the use of accelerators. The non-profit OpenMP consortium, which manages OpenMP, is also working on new directives that work with accelerators. Remember that in the future OpenACC may be targeting CPUs as well as accelerators.

Because OpenMP is targeting shared memory, it can use threads, which are created by the “master” thread and forked to run on different processors, thereby running certain portions of the code in parallel. By default, each thread executes its section of code independently; therefore, you can create “work sharing” by dividing a task among threads so that each thread can run its portion of the code. In this way, you can create both task and data parallelism.

OpenMP uses several directives:

PARALLEL
DO/PARALLEL DO and SECTIONS (primarily for work sharing)
SHARED and PRIVATE clauses for sharing data (or not) between threads
CRITICAL, ATOMIC, and BARRIER directories that coordinate and synchronize threads
Run-time functions and environment variables (not directives, but functions that OpenMP makes available).

The form of the directives is very similar to OpenACC. For C, the directives look like this:

#pragma omp construct [clause [clause] ...]

For free-format Fortan, the directives look like the following:

!$omp construct [clause [, clause] ...]

A simple Fortran example of the PARALLEL directive is as follows:

    program hello
    implicit none
 
!$OMP PARALLEL
    write(*,*)'hello world'
!$OMP END PARALLEL
 
    stop
    end

The PARALLEL directives tell the compiler to create code so that the write(*,*) statement is executed by each thread. You control the number of threads with an environment variable, OMP_NUM_THREADS, that you set to the number of threads you want.

OpenMP also has a concept of work sharing using he DO directive, which specifies that iterations of the loop immediately following the directive must be executed in parallel. This assumes that a parallel region has been initiated with the PARALLEL directive. The DO construct can get a little complicated, but I'll show a simple example that adds two vectors together and stores them in a third vector.

    program Vec_Add
    integer n, chunksize, chunk, i
    parameter (n=1000)
    parameter (chunksize=100)
    real :: A(i), B(i), C(i)
 
! Some initializations
    do i = 1, n
       A(i) = i * 1.0
       B(i) = A(i)
    enddo
 
    chunk = chunksize
!$OMP PARALLEL SHARED(A,B,C,chunk) PRIVATE(i)
!$OMP DO SCHEDULE(DYNAMIC,chunk)
    do i = 1, n
       C(i) = A(i) + B(i)
    enddo
!$OMP END DO
!$OMP END PARALLEL
 
    end program

The OpenMP portion of the code creates a parallel region and indicates that the variables A, B, C, and chunk are shared between the threads but the loop variable i is specific to each thread (i.e., each thread has its own copy). Next is the DO directive that tells the compiler that the iteration of the loop will be distributed dynamically in chunk-sized sections.

As with OpenACC, you can find many tutorials and examples on the web, including YouTube. I encourage you take a look at them, because the reward is the possibility of greatly reducing the run time of your application.

Parting Comments

It’s very interesting that 30% of the jobs at XSEDE, an organization that supplies HPC to university researchers, would only use a single core – probably because the researchers have never been taught the concepts of parallel programming or how easy it is to achieve. I have to admit that I’m guilty of writing single-thread (serial) code when I need something quick and dirty. However, if the code takes a fairly long time to run or if I have to run it many times, I will reach for compiler directives to parallelize the code easily.

At the same time, in an effort to get more speed from HPC systems, accelerators (co-processors) are becoming more and more prevalent. GPUs, DSPs, multicore (Phi), and FPGAs are all being used to improve performance. Coding for these isn't always easy, the directive-based approach called OpenACC allows you to code for accelerators more easily. Right now, OpenACC is focused on GPUs, but the promise is there for it to abstract other accelerators.

A rule of thumb that you can use for the moment is that if you want to parallelize the code on a single system that has multiple cores, then OpenMP is your likely tool. If you want to use GPUs, then OpenACC is your friend. You can combine both of them in the same code if you desire.

The GNU set of compilers have had OpenMP capability for some time. In version 5.1 of the compilers, OpenACC support had been added as well, although I'm not sure whether all of OpenACC 2.0 is supported. Now you have the opportunity to try directive programming for multicore processors, for GPUs, or for both.

One last comment: I’m sure people are reading this and thinking, “Why hasn’t OpenACC and OpenMP merged?” My answer is that I don't know, but I do know the two organizations talk to each other. Maybe someday they will create one set of directives.

Tags: co-processors , directives , multiprocessing , OpenACC , OpenMP