Porting Code to OpenACC

Starting the Port – Loops

To code applications to use OpenACC directives, I've found one of the first places to start is with loops. The basics of using OpenACC directives for loops was covered in the first OpenACC article in this series. The first step you want to take is to start with the routine that uses the most time, and start looking for loops in that routine (looking for parallelism).

Table 1 is a reminder about how to parallelize loops. If you are using Fortran, no directive marks the end of a loop. The end of loop itself is the directive (which makes life a little easier). In C, open and closed curly braces, { }, define the loop.

Table 1: parallel loopSyntax for Fortran and C

Fortran C
!$acc parallel loop
   do i=1,n
      ...
   end do
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      ...
   }
}

Ideally, look for fairly long loops for greater amounts of parallelism to improve overall performance. You don’t want to parallelize loops that only repeat three times, for example. This amount of parallelism is extremely tiny, and you might find that parallelizing small loops actually makes your code run slower. If you have nested loops, and the innermost loop is not small, start with that loop: Put in the directive and compile.

I like to start my OpenACC porting by using all of the CPU cores before moving on to the GPU. This method allows me to get a good idea of how much performance improvement the parallel loop directives can provide. For example, if the code runs on one core and the CPU has four cores, then ideally, the code will run four times faster with OpenACC directives. Because I’m focusing on just the loops, I’m easily able to determine how well the directives work.

Porting to CPUs also saves me from having to worry about moving data between the CPU and a GPU or having to worry about memory, because there is really only one pool of memory – on the node with the CPUs. In this way I can eliminate any performance penalty from moving data, and I can focus on loops for parallelism.

The sample compile lines below for the PGI compiler build Fortran (first line) and C (second line) code for CPUs.

$ pgfortran -Minfo -Minfo=accel -acc -ta=multicore [file] -o ./exe
$ pgcc -Minfo -Minfo=accel -acc -ta=multicore [file] -o ./exe

Although the executable is named exe, you can name it anything you like.

Some of these compile options are redundant, but I like to wear a belt and suspenders when compiling to get as much information from the compiler as possible. The options -Minfo and -Minfo=accel are really the same thing and will produce detailed output from the compiler. You can use either one, but I often use both. (I hope that doesn’t make a statement about my coding.)

The -acc option tells the compiler to use OpenACC. The -ta= (target architecture) option tells the compiler you are using OpenACC. Again, you can use either, but I tend to use both out of habit.

Finally, I run the exe binary (./exe). By default, the binary will try to use all of the non-hyperthreaded cores on your system. You can control the number of cores used by setting the environment variable ACC_NUM_CORES(See the document OpenACC for Multicore CPUs by PGI compiler engineer Michael Wolfe.)

After your code runs, you need to check that the output is correct. There is no sense in making the code run faster if the answers are wrong. You should be computing the amount of time the application takes to run, which you can do at the command line if you don’t want your code to take on that task.

You can also use the pgprof profiler to get an idea of how long the application takes to run and how long it spends in routines. Of course, you don't have to do this immediately after your first parallel loop, but at some point you will want to profile your parallel code for comparison with your un-parallelized code.

Nested Loop Example

Many applications have nested loops that lend themselves to parallelizing with directives. You can parallelize each loop individually (Table 2), giving you more control, or you can combine loop directives into a single directive (Table 3). In this case, the directives tell the compiler to parallelize the outer loop, and it will parallelize the inner loop as best it can.

Table 2: Simple Nested Loops

Fortran C
!$acc parallel loop
   do i=1,n
      ...
!$acc parallel loop
      do j=1,m
         ...
      enddo
      ...
   end do
 
 
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      ...
#pragma acc parallel loop
      {
         for (j=0; j < m; j++) {
            ...
         }
      }
      ...
   }
}

Table 3: Single-Directive Nested Loop

Fortran C
!$acc parallel loop
   do i=1,n
      ...
      do j=1,m
         ...
      enddo
      ...
   end do
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      ...
      for (j=0; j < m; j++) {
         ...
      }
      ...
   }
}

Another technique for parallelizing loops using OpenACC to gain more parallelism (more performance), is to tell the compiler to collapse the two loops to create one giant loop (Table 4). The collapse(2) clause tells the compiler to collapse the next two loops into one. The compiler will then take the and loops and merge them into one larger loop that is then parallelized. This method allows you to take smaller loops and collapse them into a larger loop that is possibly a better candidate for parallelism. The number of loops you collapse can vary, but collapsing a single loop is rather pointless; you should collapse two or more loops.

Table 4: Collapsing Loops

Fortran C
!$acc parallel loop collapse(2)
   do i=1,n
      ...
      do j=1,m
         ...
      enddo
      ...
   end do
 
 
#pragma acc parallel loop collapse(2)
{
   for (i=0, i < n, i++) {
      ...
      for (j=0; j < m; j++) {
         ...
      }
      ...
   }
}

If the loops cannot be collapsed for whatever reason, the compiler should tell you this, and it also should give you a reason that it cannot collapse the loops.

On the CPU, try to parallelize as many loop as possible. Use the stack rank of the routines using the most time and parallelize those loops first; then, just move down the stack. If loops are too small to parallelize, parallelizing them could make the code run slower, so don’t worry if you don’t parallelize every loop in your code.

GPUs

If you are happy with your loop-parallelizing directives on the CPU, it’s time to move on to GPUs. The first thing to do is recompile the code to target GPUs (Fortran and C, respectively):

$ pgfortran -Minfo -Minfo=accel -acc -ta=tesla [file] -o ./exe
$ pgcc -Minfo -Minfo=accel -acc -ta=tesla [file] -o ./exe

The only things you need to change in the compilation command is the target architecture from multicore to tesla, which allows the compiler to select the best compute capability to match the GPU that you are using. You can also specify the compute capability, if you like; just be sure to check where you are building the code and where you are running the code, because they might have different compute capabilities and the code may or may not run.

The PGI compiler is very good about providing information as it compiles on the GPU with the -Minfo or -Minfo=accel option. The output has line numbers and comments about the compiler actions.

The implicit directives are chosen by the compiler, and the compiler does a very good job telling you what it did. Walking through the code while examining the compiler output is a great way to learn where the compiler expects data to be on the GPU and whether or not the data is only copied to the GPU or copied and modified. This feedback is invaluable.

After you compile the code, simply run it as before. You monitor the GPU usage with the nvidia-smi command. If you run this command in a loop, you can watch the GPU usage as the code runs. If the code runs quickly or if not much of it is on the GPU, you might not see GPU usage jump up. In that case, just look at the total wall clock time.

Now that you are using the GPU, you have to worry about data movement. Possibly, your code will run slower than on a single-core CPU. Don't panic (make sure you have your towel). This means you will have to work more on the movement of data to reduce the total wall clock time.

If you want to profile the application on GPUs, you can do it the same way you did on CPUs only. The output will be different. You will have a section on the GPU routines (kernels), sometimes labeled GPU activities, with a section on CUDA API calls after that, followed by a section on the routines accelerated by OpenACC, with numbers on the percent run time in various OpenACC-accelerated routines. If your code used any OpenMP calls (you can combine OpenMP and OpenACC), you will see it in the next section. Finally, the last section is the CPU profile, which contains the same information as when profiling CPUs only.

Overall, profiling just adds the GPU and OpenACC information in front of the CPU profiling information. The format of the information is very much the same as before and should be straightforward to read.

It might not seem like it, but memory movement between the CPU and the GPU can have a very large effect on code performance. Compilers are good about determining when they need data on the GPU for a specific kernel (section of code on the GPU), but they focus just on that kernel. The code writer knows much more about the code and can make decisions about whether to move data to or from the GPU or leave it with the GPU for future kernels.

I tend to use a fairly simple approach to data movement directives. With the PGI compiler, I use the compiler feedback to tell me where it believes data movement directives are needed. Listing 1 is a snippet from an example of a code compile with only loop directives in the code. Looking through the output, you will see phrases such as Generating implicit .... The compiler also tells you what type of directive it is using. These are great tips for adding directives to your code.

Listing 1: PGI Compiler Ouput

main:
     43, Loop not vectorized/parallelized: contains call
     50, Memory set idiom, loop replaced by call to __c_mset8
     54, FMA (fused multiply-add) instruction(s) generated
     60, Loop not vectorized/parallelized: contains call
lbm_block:
    100, FMA (fused multiply-add) instruction(s) generated
    108, Generating implicit copyout(f_new(1:lx,1:ly,1:9),u_x(1:lx,1:ly),u_y(1:lx,1:ly),rho(1:lx,1:ly))
    109, Loop is parallelizable
    110, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        109, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        110,   ! blockidx%x threadidx%x collapsed
bc:
    153, Accelerator kernel generated
         Generating Tesla code
        154, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        161, !$acc loop seq
    153, Generating implicit copyin(f(:,:,:),block(:,:),vec_1(:))
         Generating implicit copy(f_new(2,2:ly-1,:))
         Generating implicit copyin(e(:))
    161, Complex loop carried dependence of f_new prevents parallelization
         Loop carried dependence of f_new prevents parallelization
         Loop carried backward dependence of f_new prevents vectorization
    178, Accelerator kernel generated
         Generating Tesla code
        179, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        185, !$acc loop seq
    178, Generating implicit copyin(f(:,:,:),block(:,:),vec_2(:))
         Generating implicit copy(f_new(lx-1,2:ly-1,:))
         Generating implicit copyin(e(:))
    185, Complex loop carried dependence of f_new prevents parallelization
         Loop carried dependence of f_new prevents parallelization
         Loop carried backward dependence of f_new prevents vectorization
...

Choose one of the kernels the compiler has indicated needs data on the GPU, then examine the routine for OpenACC parallel loop directives and determine what data is needed by that routine. Then, you can determine whether the data that needs to be moved to the GPU is modified in that kernel.

If the data is modified, you can use the data copy(A, B, ...) directive, which copies the data from the CPU to the GPU when the kernel is entered. When the kernel is exited, the data is copied from the GPU back to the CPU.

If the data is only needed on the GPU and isn’t modified, you can use the data copyin(A, B, ...) directive. When a GPU kernel is entered, the data is copied from the CPU to the GPU, but when the kernel is exited, the data is not copied back. Because the data is not modified, it doesn’t need to be copied back.

You can also use the data copyout(A, B, ...) directive for data that is on the GPU and is copied back to the CPU. I don’t tend to use this directive and just use data copy and data copyin.

Every time you move data, you should recompile the code and check the output to make sure it is the same as the non-GPU code. Another good idea is to look at the compiler output when a new data directive is added for information about the implicit data movement directives that insert into the code. Ideally, after you insert a data directive, the compiler should not insert an implicit data directive.

After the first routine, you should move to a neighboring, perhaps following, routine. Again, look at the compiler output for data directives that were implicitly inserted into the code to determine what data needs to be on the GPU. Compare this data to that from the previous routine. If you see some data in common, you just found a place where you can eliminate a data copy.

In this case, you can copy the common data over before the first routine and copy it back after the after the second routine. Be sure to check whether you need the data copy or data copyin directive. Also, be sure to check the output for accuracy.

Table 5 shows a simple example of the base code that includes parallel loop directives. From the code, you know that some of the data is needed on the GPU. You can also get this information from the compiler by looking for Generating implicit phrases in its output.

Table 5: Basic Parallel Loops

Fortran C
!$acc parallel loop
   do i=1,n
      A(i) = B(i) + C(i)
   enddo
 
!$acc parallel loop
   do j=1,m
      B(i) = B(i) * D(i)
   enddo
 
 
 
 
#pragma acc parallel loop
{
   for (i=0, i < n, i++) {
      A[i] = B[i] + C[i];
   }
}
 
#pragma acc parallel loop
{
   for (j=0; j < m; j++) {
      B[j] = B[j] * D[j];
   }
}

Because this is the first time through this code, I’ll be a little naive and just use the data copy directive to copy the data from the CPU to the GPU and back (Table 6). Notice that each kernel parallel loop directive has its own data copy directive.

Table 6: Code with data copy

Fortran C
!$acc data copy(A, B, C)
!$acc parallel loop
   do i=1,n
      A(i) = B(i) + C(i)
   enddo
!$acc end data
 
!$acc data copy(B, D)
!$acc parallel loop
   do j=1,m
      B(j) = B(j) * D(j)
   enddo
!$acc end data
 
 
 
 
 
 
#pragma acc data copy(A, B, C)
{
   #pragma acc parallel loop
   {
      for (i=0, i < n, i++) {
         A[i] = B[i] + C[i];
      }
   }
}
 
#pragma acc data copy(B, D)
{
   #pragma acc parallel loop
   {
      for (j=0; j < m; j++) {
         B[j] = B[j] * D[j]
      }
   }
}

By carefully examining the code, you can see that it copies array back and forth twice, which is wasteful data movement. You can use directives to copy the array only once.

Moreover, if you look through the code, you will notice that some of the data arrays are needed only as input and are not modified within the accelerated code. These can be copied to the GPU with the data copyin directive, saving a copy for the data that doesn't need to be copied back to the CPU (Table 7).

Table 7: Saving Data Movement

Fortran C
!$acc data copy(A, B) copyin(C, D)
!$acc parallel loop
   do i=1,n
      A(i) = B(i) + C(i)
   enddo
 
!$acc parallel loop
   do j=1,m
      B(j) = B(j) * D(j)
   enddo
!$acc end data

 
 
 
 
#pragma acc data copy(A, B) copyin(C, D)
{
   #pragma acc parallel loop
   {
      for (i=0, i < n, i++) {
         A[i] = B[i] + C[i];
      }
   }
 
   #pragma acc parallel loop
   {
      for (j=0; j < m; j++) {
         B[j] = B[j] * D[j]
      }
   }
}

Notice that just one data movement directive covers both kernels (both loops) and that the data copyin directive saves the data movement that occurred with the data copy directive.