Directive Coding

OpenACC

The first set of compiler directives is called OpenACC. It was started by Cray, CAPS, Nvidia, and PGI (Portland Group) and is somewhat similar to OpenMP in that you annotate your code with a series of comments that the compiler interprets as directives to build sections of code for the accelerator. OpenACC originally targeted GPUs from Nvidia and AMD (ATI), but it being expanded to target other accelerators, possibly including other CPU cores (multicore processors) at some point.

OpenACC has two major versions: Version 1.0 of the standard, announced November 14, 2011, included a number of directives for coding accelerators (initially GPUs), and very quickly OpenACC compilers became quickly available. Version 2.0 was finalized on June 2013, although a preview was posted on November 12, 2012. It added some new capabilities and expanded functionality that had been learned since version 1.0 was released. Work is ongoing for the next OpenACC standard, as well.

As mentioned in the previously, OpenACC is a set of directives you add to your code as comments. An OpenACC compiler will interpret these as directives, but if the compiler is not OpenACC ready, it will think they are just comments and ignore them. This makes it easy to create portable code that can be used with a variety of compilers.

The directives cover a fairly large range of capability that tells compilers to create code that does a wide range of tasks, including:

  • Initiate the accelerator startup/shutdown
  • Manage program and/or data transfer between the CPU and the accelerator (Note: at this time OpenACC assumes that the memory in the accelerator is distinct from the CPU, requiring data transfer.)
  • Manage the work between the accelerator and the CPU

On the basis of the directives, the compilers generate the best code possible, but it is up to the programmer to tune their code to take advantage of the accelerators.

Because accelerators target parallel computations, you can imagine that application code sections that target accelerators include both coarse-grained and fine-grained parallelism. Coarse-grained parallelism allows multiple executions in the accelerator at the same time, whereas fine-grained parallelism includes threads of execution within a single execution unit, such as SIMD and vector operations. Moreover, accelerators are good candidates for work-sharing loops, or “kernel” regions, wherein one or more loops are executed as kernels. (Generically, a kernel is a small section of code.)

The syntax for directives is pretty simple. For Fortran, the directive looks like:

!$acc directive [clause [, clause] ...]

In Fortran (from free-format Fortran90 onward), ! is a comment. For C, the directive looks like:

#pragma acc directive [clause [, clause] ...]

OpenACC directives fall into several categories:

  • Accelerator parallel region/kernels directives
  • Loop directives
  • Data declaration directives
  • Data region directives
  • Cache directives
  • Wait/update directives
  • Environment variables

Although I won’t go through all of the directives and clauses, I’ll look at a couple to get a feel for what they look like and do.

The first directive or construct is a parallel construct. In Fortran, the code looks something like this:

!$acc parallel [clause [, clause] ...]
< structured code block >
!$acc end parallel

Notice that in Fortran you have to insert a directive that tells where the parallel region ends. In C, the code looks like this:

#pragma acc directive [clause [, clause] ...]
< structured code block >

This directive tells the compiler to create code where gangs of worker threads are executing the “structured code block” in parallel on the accelerator. [Note: In CUDA, a “gang of workers” is a CUDA block of threads.] One worker in each gang begins executing the code in the structured block. The number of gangs and the number of workers in each gang remain constant for the duration of the parallel region.

The second directive is a kernels directive. In Fortran, the directive looks like this:

!$acc kernels [clause [, clause] ... ]
< structured code block>
!$acc end kernels
< structured code block >

In C, the same thing looks like the following:

#pragma acc kernels [clause [, clause] ...]
< structured code block >

The kernels directive tells the compiler that the structured code block has a region that can be compiled into a sequence of kernels for execution on the accelerator. It is similar to the parallel directive, but the loops in the kernels will be independent kernels rather than one large kernel. These independent kernels and associated data transfers may overlap with other kernels.

A simple example of this directive (in Fortran) is:

!$acc kernels
    do i=1,n
       a(i) = 0.0
       b(i) = 1.0
       c(i) = 2.0
    end do
 
    do i=1,n
       a(i) = b(i) + c(i)
    end do
!$acc end kernels

With the simple kernels directive, the compiler creates a kernel from the first loop and a second kernel from the second loop (i.e., they are independent). These kernels can then be run on the accelerator.

You can find a lot of introductory materials on the web about OpenACC, as well as some YouTube videos that walk through OpenACC and examples.

OpenMP

OpenMP was the first set of directives developed that helped the compiler find regions of code that could be run in parallel on shared memory systems. The last bit, shared memory systems, is important. OpenACC handles data movement to/from accelerators that may have their own memory, whereas OpenMP has to use shared memory. Today's multicore and multisocket systems are shared memory, so that’s usually not an issue.

OpenMP started in 1997 with version 1.0 for Fortran. They release version 4.0 of the specification in July 2013. Version 4.0 has some directives to allow for the use of accelerators. The non-profit OpenMP consortium, which manages OpenMP, is also working on new directives that work with accelerators. Remember that in the future OpenACC may be targeting CPUs as well as accelerators.

Because OpenMP is targeting shared memory, it can use threads, which are created by the “master” thread and forked to run on different processors, thereby running certain portions of the code in parallel. By default, each thread executes its section of code independently; therefore, you can create “work sharing” by dividing a task among threads so that each thread can run its portion of the code. In this way, you can create both task and data parallelism.

OpenMP uses several directives:

  • PARALLEL
  • DO/PARALLEL DO and SECTIONS (primarily for work sharing)
  • SHARED and PRIVATE clauses for sharing data (or not) between threads
  • CRITICAL, ATOMIC, and BARRIER directories that coordinate and synchronize threads
  • Run-time functions and environment variables (not directives, but functions that OpenMP makes available).

The form of the directives is very similar to OpenACC. For C, the directives look like this:

#pragma omp construct [clause [clause] ...]

For free-format Fortan, the directives look like the following:

!$omp construct [clause [, clause] ...]

A simple Fortran example of the PARALLEL directive is as follows:

    program hello
    implicit none
 
!$OMP PARALLEL
    write(*,*)'hello world'
!$OMP END PARALLEL
 
    stop
    end

The PARALLEL directives tell the compiler to create code so that the write(*,*) statement is executed by each thread. You control the number of threads with an environment variable, OMP_NUM_THREADS, that you set to the number of threads you want.

OpenMP also has a concept of work sharing using he DO directive, which specifies that iterations of the loop immediately following the directive must be executed in parallel. This assumes that a parallel region has been initiated with the PARALLEL directive. The DO construct can get a little complicated, but I'll show a simple example that adds two vectors together and stores them in a third vector.

    program Vec_Add
    integer n, chunksize, chunk, i
    parameter (n=1000)
    parameter (chunksize=100)
    real :: A(i), B(i), C(i)
 
! Some initializations
    do i = 1, n
       A(i) = i * 1.0
       B(i) = A(i)
    enddo
 
    chunk = chunksize
!$OMP PARALLEL SHARED(A,B,C,chunk) PRIVATE(i)
!$OMP DO SCHEDULE(DYNAMIC,chunk)
    do i = 1, n
       C(i) = A(i) + B(i)
    enddo
!$OMP END DO
!$OMP END PARALLEL
 
    end program

The OpenMP portion of the code creates a parallel region and indicates that the variables A, B, C, and chunk are shared between the threads but the loop variable i is specific to each thread (i.e., each thread has its own copy). Next is the DO directive that tells the compiler that the iteration of the loop will be distributed dynamically in chunk-sized sections.

As with OpenACC, you can find many tutorials and examples on the web, including YouTube. I encourage you take a look at them, because the reward is the possibility of greatly reducing the run time of your application.