Porting Code to OpenACC

OpenACC directives can improve performance if you know how to find where parallel code will make the greatest difference.

In previous articles, I talked about how OpenACC can help you parallelize your code and gave a few simple examples of how to use OpenACC directives, but I didn’t discuss how to go about porting your code. In this article, I present one approach to porting code.

Checker

When porting code to OpenACC, or really any new language, you need to be able to check that the output is correct. One way is to have the code itself check the output, which can involve comparing the results against a known good solution stored within the application or that can easily be computed by the application. Another way is to use an external output checker that takes the output from the application and checks it against known, good output. This can be as simple as reading two data files and comparing the results.

Ouput from a CPU compared with a CPU+GPU can differ by a small amount through reduction. For example, summing floating point numbers in different orders through parallelism can yield slightly different results. FMA (fuse-multiply-add) is another source of difference that occurs when you multiply then add, rather than doing it in one step.

As a result, you should not perform a bit-for-bit comparison of the “correct” answer from the ported code (i.e., don't use something like md5sum), because you will find differences that are not necessarily indicative of incorrect code.

Instead, you should look at differences relative to the data values. For example, if the difference between two numbers is 100.0, but you are working with values of 10^8, then the difference (0.001%) might not be important. It’s really up to the user to decide whether the comparison is significant or not, but it's likely that a difference of this kind is not important, and the answer can be considered “correct.”

Comparing matrices is another challenge. A simple article search online illustrates different ways to compare two matrices. As a brute force method, you could subtract the (i,j) entries from the “new” output relative to the known “good” output to get the largest difference in the matrices. (It is a scalar value.) Next, search through the known good matrix for the largest absolute value and compute the largest percent difference: (largest difference)/(largest value). Although this approache has no mathematical underpinnings, it allows you to look for the largest possible difference. How you compare two matrices is up to you, but be sure that you understand what you are computing and what the output means.

Profiling

As you port your application to use OpenACC, understanding the effect of directives on performance is key. The first and arguably the easiest measure of performance is wall clock time for the entire application (start to finish). The second is the profile of the application, or the amount of time spent in each routine in the code.

Note that “profile” here does not mean a time history but is simply a list of the sum total of time spent in each routine of the application. If you order these times from the routine with the most time to the routine with the least time, you quickly see which routines take the most time, which suggests an “attack” plan for porting. By focusing on the routines that take the most time, you can use OpenACC directives to reduce the run time of that routine and thus reduce the overall application run time.

One way to profile your application is to instrument it to measure the wall clock time of each routine. If routines are called repeatedly, it will need to sum these times. A second way is to use the profiling tools in the compiler. The exact details depend on the specific compiler you are using. If you are using GCC, you can find some resources online. If you are using the PGI compiler, you can use pgprof at the command line. An online tutorial explains how to use pgprof with OpenACC applications. For this article, I use the Community Edition of the PGI compilers.

Assume you have an application you want to port to OpenACC. The first thing to do is profile it by creating a stack rank of the routines with the most run time. Compile your code normally with pgprof (no extra switches), then run the code with the command:

$ pgprof [options] ./exe [application arguments]

In this case, the executable is named exe for simplicity. For a first profile, I use the options:

$ pgprof --cpu-profiling-thread-mode separated --cpu-profiling-mode top-down -o my.prof

The first option separates the output by thread, which is really useful if the code uses MPI, which often uses extra threads. The second option lists the routines that use the most time first and the routines that use the least amount of time last, which is referred to as a “stack rank” (first to the last). The third option, -o my.profallows you to redirect the profile data to an output file; then, you can rerun the analysis anytime you want with the simple command:

$ pgprof -i my.prof

This simple application of pgprof can get you started profiling your application. You should profile your application every so often to understand how the routines stack rank relative to one another. Although it is likely to change, it can tell you what routines to target for OpenACC.

Starting the Port – Loops

To code applications to use OpenACC directives, I've found one of the first places to start is with loops. The basics of using OpenACC directives for loops was covered in the first OpenACC article in this series. The first step you want to take is to start with the routine that uses the most time, and start looking for loops in that routine (looking for parallelism).

Table 1 is a reminder about how to parallelize loops. If you are using Fortran, no directive marks the end of a loop. The end of loop itself is the directive (which makes life a little easier). In C, open and closed curly braces, { }, define the loop.

Table 1: parallel loopSyntax for Fortran and C

Fortran C
!$acc parallel loop
   do i=1,n
      ...
   end do
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      ...
   }
}

Ideally, look for fairly long loops for greater amounts of parallelism to improve overall performance. You don’t want to parallelize loops that only repeat three times, for example. This amount of parallelism is extremely tiny, and you might find that parallelizing small loops actually makes your code run slower. If you have nested loops, and the innermost loop is not small, start with that loop: Put in the directive and compile.

I like to start my OpenACC porting by using all of the CPU cores before moving on to the GPU. This method allows me to get a good idea of how much performance improvement the parallel loop directives can provide. For example, if the code runs on one core and the CPU has four cores, then ideally, the code will run four times faster with OpenACC directives. Because I’m focusing on just the loops, I’m easily able to determine how well the directives work.

Porting to CPUs also saves me from having to worry about moving data between the CPU and a GPU or having to worry about memory, because there is really only one pool of memory – on the node with the CPUs. In this way I can eliminate any performance penalty from moving data, and I can focus on loops for parallelism.

The sample compile lines below for the PGI compiler build Fortran (first line) and C (second line) code for CPUs.

$ pgfortran -Minfo -Minfo=accel -acc -ta=multicore [file] -o ./exe
$ pgcc -Minfo -Minfo=accel -acc -ta=multicore [file] -o ./exe

Although the executable is named exe, you can name it anything you like.

Some of these compile options are redundant, but I like to wear a belt and suspenders when compiling to get as much information from the compiler as possible. The options -Minfo and -Minfo=accel are really the same thing and will produce detailed output from the compiler. You can use either one, but I often use both. (I hope that doesn’t make a statement about my coding.)

The -acc option tells the compiler to use OpenACC. The -ta= (target architecture) option tells the compiler you are using OpenACC. Again, you can use either, but I tend to use both out of habit.

Finally, I run the exe binary (./exe). By default, the binary will try to use all of the non-hyperthreaded cores on your system. You can control the number of cores used by setting the environment variable ACC_NUM_CORES(See the document OpenACC for Multicore CPUs by PGI compiler engineer Michael Wolfe.)

After your code runs, you need to check that the output is correct. There is no sense in making the code run faster if the answers are wrong. You should be computing the amount of time the application takes to run, which you can do at the command line if you don’t want your code to take on that task.

You can also use the pgprof profiler to get an idea of how long the application takes to run and how long it spends in routines. Of course, you don't have to do this immediately after your first parallel loop, but at some point you will want to profile your parallel code for comparison with your un-parallelized code.

Nested Loop Example

Many applications have nested loops that lend themselves to parallelizing with directives. You can parallelize each loop individually (Table 2), giving you more control, or you can combine loop directives into a single directive (Table 3). In this case, the directives tell the compiler to parallelize the outer loop, and it will parallelize the inner loop as best it can.

Table 2: Simple Nested Loops

Fortran C
!$acc parallel loop
   do i=1,n
      ...
!$acc parallel loop
      do j=1,m
         ...
      enddo
      ...
   end do
 
 
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      ...
#pragma acc parallel loop
      {
         for (j=0; j < m; j++) {
            ...
         }
      }
      ...
   }
}

Table 3: Single-Directive Nested Loop

Fortran C
!$acc parallel loop
   do i=1,n
      ...
      do j=1,m
         ...
      enddo
      ...
   end do
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      ...
      for (j=0; j < m; j++) {
         ...
      }
      ...
   }
}

Another technique for parallelizing loops using OpenACC to gain more parallelism (more performance), is to tell the compiler to collapse the two loops to create one giant loop (Table 4). The collapse(2) clause tells the compiler to collapse the next two loops into one. The compiler will then take the and loops and merge them into one larger loop that is then parallelized. This method allows you to take smaller loops and collapse them into a larger loop that is possibly a better candidate for parallelism. The number of loops you collapse can vary, but collapsing a single loop is rather pointless; you should collapse two or more loops.

Table 4: Collapsing Loops

Fortran C
!$acc parallel loop collapse(2)
   do i=1,n
      ...
      do j=1,m
         ...
      enddo
      ...
   end do
 
 
#pragma acc parallel loop collapse(2)
{
   for (i=0, i < n, i++) {
      ...
      for (j=0; j < m; j++) {
         ...
      }
      ...
   }
}

If the loops cannot be collapsed for whatever reason, the compiler should tell you this, and it also should give you a reason that it cannot collapse the loops.

On the CPU, try to parallelize as many loop as possible. Use the stack rank of the routines using the most time and parallelize those loops first; then, just move down the stack. If loops are too small to parallelize, parallelizing them could make the code run slower, so don’t worry if you don’t parallelize every loop in your code.

GPUs

If you are happy with your loop-parallelizing directives on the CPU, it’s time to move on to GPUs. The first thing to do is recompile the code to target GPUs (Fortran and C, respectively):

$ pgfortran -Minfo -Minfo=accel -acc -ta=tesla [file] -o ./exe
$ pgcc -Minfo -Minfo=accel -acc -ta=tesla [file] -o ./exe

The only things you need to change in the compilation command is the target architecture from multicore to tesla, which allows the compiler to select the best compute capability to match the GPU that you are using. You can also specify the compute capability, if you like; just be sure to check where you are building the code and where you are running the code, because they might have different compute capabilities and the code may or may not run.

The PGI compiler is very good about providing information as it compiles on the GPU with the -Minfo or -Minfo=accel option. The output has line numbers and comments about the compiler actions.

The implicit directives are chosen by the compiler, and the compiler does a very good job telling you what it did. Walking through the code while examining the compiler output is a great way to learn where the compiler expects data to be on the GPU and whether or not the data is only copied to the GPU or copied and modified. This feedback is invaluable.

After you compile the code, simply run it as before. You monitor the GPU usage with the nvidia-smi command. If you run this command in a loop, you can watch the GPU usage as the code runs. If the code runs quickly or if not much of it is on the GPU, you might not see GPU usage jump up. In that case, just look at the total wall clock time.

Now that you are using the GPU, you have to worry about data movement. Possibly, your code will run slower than on a single-core CPU. Don't panic (make sure you have your towel). This means you will have to work more on the movement of data to reduce the total wall clock time.

If you want to profile the application on GPUs, you can do it the same way you did on CPUs only. The output will be different. You will have a section on the GPU routines (kernels), sometimes labeled GPU activities, with a section on CUDA API calls after that, followed by a section on the routines accelerated by OpenACC, with numbers on the percent run time in various OpenACC-accelerated routines. If your code used any OpenMP calls (you can combine OpenMP and OpenACC), you will see it in the next section. Finally, the last section is the CPU profile, which contains the same information as when profiling CPUs only.

Overall, profiling just adds the GPU and OpenACC information in front of the CPU profiling information. The format of the information is very much the same as before and should be straightforward to read.

It might not seem like it, but memory movement between the CPU and the GPU can have a very large effect on code performance. Compilers are good about determining when they need data on the GPU for a specific kernel (section of code on the GPU), but they focus just on that kernel. The code writer knows much more about the code and can make decisions about whether to move data to or from the GPU or leave it with the GPU for future kernels.

I tend to use a fairly simple approach to data movement directives. With the PGI compiler, I use the compiler feedback to tell me where it believes data movement directives are needed. Listing 1 is a snippet from an example of a code compile with only loop directives in the code. Looking through the output, you will see phrases such as Generating implicit .... The compiler also tells you what type of directive it is using. These are great tips for adding directives to your code.

Listing 1: PGI Compiler Ouput

main:
     43, Loop not vectorized/parallelized: contains call
     50, Memory set idiom, loop replaced by call to __c_mset8
     54, FMA (fused multiply-add) instruction(s) generated
     60, Loop not vectorized/parallelized: contains call
lbm_block:
    100, FMA (fused multiply-add) instruction(s) generated
    108, Generating implicit copyout(f_new(1:lx,1:ly,1:9),u_x(1:lx,1:ly),u_y(1:lx,1:ly),rho(1:lx,1:ly))
    109, Loop is parallelizable
    110, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        109, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        110,   ! blockidx%x threadidx%x collapsed
bc:
    153, Accelerator kernel generated
         Generating Tesla code
        154, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        161, !$acc loop seq
    153, Generating implicit copyin(f(:,:,:),block(:,:),vec_1(:))
         Generating implicit copy(f_new(2,2:ly-1,:))
         Generating implicit copyin(e(:))
    161, Complex loop carried dependence of f_new prevents parallelization
         Loop carried dependence of f_new prevents parallelization
         Loop carried backward dependence of f_new prevents vectorization
    178, Accelerator kernel generated
         Generating Tesla code
        179, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        185, !$acc loop seq
    178, Generating implicit copyin(f(:,:,:),block(:,:),vec_2(:))
         Generating implicit copy(f_new(lx-1,2:ly-1,:))
         Generating implicit copyin(e(:))
    185, Complex loop carried dependence of f_new prevents parallelization
         Loop carried dependence of f_new prevents parallelization
         Loop carried backward dependence of f_new prevents vectorization
...

Choose one of the kernels the compiler has indicated needs data on the GPU, then examine the routine for OpenACC parallel loop directives and determine what data is needed by that routine. Then, you can determine whether the data that needs to be moved to the GPU is modified in that kernel.

If the data is modified, you can use the data copy(A, B, ...) directive, which copies the data from the CPU to the GPU when the kernel is entered. When the kernel is exited, the data is copied from the GPU back to the CPU.

If the data is only needed on the GPU and isn’t modified, you can use the data copyin(A, B, ...) directive. When a GPU kernel is entered, the data is copied from the CPU to the GPU, but when the kernel is exited, the data is not copied back. Because the data is not modified, it doesn’t need to be copied back.

You can also use the data copyout(A, B, ...) directive for data that is on the GPU and is copied back to the CPU. I don’t tend to use this directive and just use data copy and data copyin.

Every time you move data, you should recompile the code and check the output to make sure it is the same as the non-GPU code. Another good idea is to look at the compiler output when a new data directive is added for information about the implicit data movement directives that insert into the code. Ideally, after you insert a data directive, the compiler should not insert an implicit data directive.

After the first routine, you should move to a neighboring, perhaps following, routine. Again, look at the compiler output for data directives that were implicitly inserted into the code to determine what data needs to be on the GPU. Compare this data to that from the previous routine. If you see some data in common, you just found a place where you can eliminate a data copy.

In this case, you can copy the common data over before the first routine and copy it back after the after the second routine. Be sure to check whether you need the data copy or data copyin directive. Also, be sure to check the output for accuracy.

Table 5 shows a simple example of the base code that includes parallel loop directives. From the code, you know that some of the data is needed on the GPU. You can also get this information from the compiler by looking for Generating implicit phrases in its output.

Table 5: Basic Parallel Loops

Fortran C
!$acc parallel loop
   do i=1,n
      A(i) = B(i) + C(i)
   enddo
 
!$acc parallel loop
   do j=1,m
      B(i) = B(i) * D(i)
   enddo
 
 
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      A[i] = B[i] + C[i];
   }
}
 
#pragma acc parallel loop
{
   for (j=0; j < m; j++) {
      B[j] = B[j] * D[j];
   }
}

Because this is the first time through this code, I’ll be a little naive and just use the data copy directive to copy the data from the CPU to the GPU and back (Table 6). Notice that each kernel parallel loop directive has its own data copy directive.

Table 6: Code with data copy

Fortran C
!$acc data copy(A, B, C)
!$acc parallel loop
   do i=1,n
      A(i) = B(i) + C(i)
   enddo
!$acc end data
 
!$acc data copy(B, D)
!$acc parallel loop
   do j=1,m
      B(j) = B(j) * D(j)
   enddo
!$acc end data
 
 
 
 
 
 
#pragma acc data copy(A, B, C)
{
   #pragma acc parallel loop
   {
      for (i=0, i < n, i++) {
         A[i] = B[i] + C[i];
      }
   }
}
 
#pragma acc data copy(B, D)
{
   #pragma acc parallel loop
   {
      for (j=0; j < m; j++) {
         B[j] = B[j] * D[j]
      }
   }
}

By carefully examining the code, you can see that it copies array back and forth twice, which is wasteful data movement. You can use directives to copy the array only once.

Moreover, if you look through the code, you will notice that some of the data arrays are needed only as input and are not modified within the accelerated code. These can be copied to the GPU with the data copyin directive, saving a copy for the data that doesn't need to be copied back to the CPU (Table 7).

Table 7: Saving Data Movement

Fortran C
!$acc data copy(A, B) copyin(C, D)
!$acc parallel loop
   do i=1,n
      A(i) = B(i) + C(i)
   enddo
 
!$acc parallel loop
   do j=1,m
      B(j) = B(j) * D(j)
   enddo
!$acc end data

 
 
 
 
#pragma acc data copy(A, B) copyin(C, D)
{
   #pragma acc parallel loop
   {
      for (i=0, i < n, i++) {
         A[i] = B[i] + C[i];
      }
   }
 
   #pragma acc parallel loop
   {
      for (j=0; j < m; j++) {
         B[j] = B[j] * D[j]
      }
   }
}

Notice that just one data movement directive covers both kernels (both loops) and that the data copyin directive saves the data movement that occurred with the data copy directive.

Summary

Now that the HPC world is moving toward heterogeneous computing, taking advantage of the capability of various computing elements is key to getting better performance. OpenACC uses directives to parallelize your code and improve performance. In this article, I showed you how to port your code with parallel loop directives and directives for copying data between the CPU and GPU. If you’re starting to use OpenACC, the approach is worth trying a few times until you get the hang of porting.