Parallelizing Code – Loops

OpenACC is a great tool for parallelizing applications for a variety of processors. In this article, I look at one of the most powerful directives, loop.

In the last half of 2018, I wrote about critical HPC admin tools. Often, HPC admins become programming consultants by helping researchers get started with applications, debug the applications, and improve performance. In addition to administering the system, then, they have to know good programming techniques and what tools to use.

MPI+X

The world is moving toward Exascale computing (at least 1018 floating point operations per second [FLOPS]) at a rapid pace. Even though most systems aren’t Exascale, quite a few are at least Petascale (>1015 FLOPS) and use a large number of nodes. Programming techniques are evolving to accommodate Petascale systems while getting ready for Exascale. Meanwhile, a key programming technique called MPI+X refers to using MPI in an application for data communication between nodes while using something else (the X) for application coding within the node.

The can refer to any of several tools or languages, including the use of MPI across all nodes (i.e., MPI+MPI), which has been a prevalent programming technique for quite a while. Classically, each core assigned to an application is assigned an MPI rank and communicates over whatever network exists between the nodes. To adapt to larger and larger systems, data communication has adapted to use multiple levels of communication. MPI ranks within the same node can communicate directly without a network interface card (NIC). Ranks that are not on the same physical node communicate through NIC. Networking techniques can take advantage of specific topologies to reduce latency, improve bandwidth, and improve scalability.

Directives

A popular category is the directive, which includes OpenMP and OpenACC, which were formed to standardize on directives that are not specific to a machine, operating system, or vendor. Directives are also referred to as “pragmas” and instruct the compiler to perform certain code transformations before compiling the resulting code.

If the compiler doesn’t understand the directive (pragma), it will ignore it. This feature is important because it allows for a single codebase, reducing the likelihood of adding errors to the code. For example, you can place OpenMP directives in your serial code and still run the code in either serial mode or parallel mode, depending on your compiler setting. In C/C++ code, a pragma will look like #pragma token-string. For instance,

#pragma omp parallel for

might be all that’s needed to parallelize a simple for loop.

In this article, I look at OpenACC, a directives-based approach to parallelizing code and improving code performance.

OpenACC

OpenACC was originally developed to add accelerator device support that was missing from OpenMP at the time. The design goals for OpenACC are a bit different from OpenMP. OpenACC takes a descriptive approach by using directives to describe the properties of the parallel region to the compiler, which then generates the best code possible to meet the description on which you plan to run.

The goal of OpenACC was to support a wide range of accelerators, including multicore CPUs. Currently it supports

  • POWER CPU
  • Sunway
  • x86 CPU
  • x86 Xeon Phi
  • Nvidia GPU
  • PEZY-SC

As with OpenMP, OpenACC allows you to use a single codebase, which can reduce errors from the introduction of new code. To compilers, the directives just look like comments. OpenACC uses parallel directives (regions that are parallelizable), data directives (data movements to/from the accelerator devices), and clauses. Fundamentally, OpenACC requires that the parallel loop be free of any data dependencies, which sometimes requires loops to be rewritten. When such a code refactoring is required, the resulting code often runs faster both with and without the directives.

OpenACC uses the concept of “gangs” or “workers” instead of “threads.” Gangs are similar to threads, but they operate completely independently of each other without the ability to synchronize, which explains the requirement that loops be free of data dependencies. This setup enables the loop to run in parallel on the smallest and the largest of parallel processors without change.

The run-time environment will select how that code is mapped to gangs on the target architecture. For example, on CPUs, the gangs are mapped to cores. For GPUs, the gangs are mapped to the GPU processors. OpenACC can also use multiple gangs or combinations of gangs and lower level parallelism (to be covered later).

Parallel Computing

Classically, applications were written to be executed serially. One computation was performed after another. But this approach doesn't take into account that some computations or regions can be computed simultaneously. Finding and parallelizing such regions of an application allows it to run faster and scale better than serial applications (see Amdahl’s Law).

Today's processors have multiple cores, and accelerators such as GPUs have thousands of lightweight cores that can be used, as well. At a simplistic level, parallelization breaks a problem into discrete parts that can be solved simultaneously. Each part is then executed on different processors but with some sort of coordination.

One likely place for parallelization to occur is in loops. In this simple Fortran loop

do i = 1,n
   z(i) = x(i) * y(i)
enddo

each value z(i) is not dependent on previous values of z(i). Therefore, all values of z(i) can be computed in any order or at the same time. If the upper limit of the loop, n, is large enough, some processing hardware can greatly speed up the computation.

What happens if z(i) depends on a previous value, as in the following:

do i = 2,n
   z(i) = z(i-1)*2
enddo

As written, you can’t generally parallelize the loop because of data dependency. This dependency is also called loop-level parallelism. However, for this particular example, you could rewrite the loop in a way that can be parallelized:

do i = 2,n
   z(i) = z(i)*2**(i-1)
enddo

When the compiler tells you a loop cannot be parallelized, you, the programmer, will need to determine whether it can be refactored in such a way that it can be parallelized.

In other situations, you might have to pay attention to race conditions, mutual exclusion, and parallel slowdown. For the purposes of this article, I won’t cover these situations or conditions.

OpenACC Introduction

OpenACC has a number of directives that can be put into code that give the compiler information about how you would like to parallelize the code. The general form of the syntax is shown in Table 1, but before getting too deep into directives, I’ll look at what is happening in the code with the parallel directive.

Table 1: Parallelizing Fortran and C

Fortran C
!$acc 
< code >
#pragma acc 
< code >

The parallel directive is a way to express parallelism in your code. Table 2 is a very simple example of getting started with directives in both Fortran and C.

Table 2: parallel Directive

Fortran C
!$acc parallel
< parallel code >
!$acc end parallel
 
#pragma acc parallel
{
   < parallel code >
}

The directives for Fortran and C begin with comment characters in their respective languages so that a non-OpenACC compiler will ignore them.

In Fortran, immediately after the comment (!) is $acc, which informs the compiler that everything that follows is an OpenACC directive. In C, immediately after the comment (#) is pragma, informing the compiler that some sort of directive follows. After pragma, the acc tells the compiler that everything after it is an OpenACC directive. Inserting directives is pretty straightforward. The only piece of advice is not to forget the before acc in Fortran or the #pragma before acc in C.

One of the neat things about OpenACC is that directives can be added incrementally. You should always start with working serial code (with or without MPI); then, you can start adding OpenACC directives to explore parallelism. Along with working serial code, you absolutely need a way to verify whether the output from the application is correct. If the output is verified as correct, then you can annotate more of the code with directives.

Annotating code is a process. You may find occasions when adding a directive will cause the code to slow down. Do not worry. As you continue studying your code and annotating it, you will find the reasons for slowdowns and be able to correct them. This happens frequently when code written for GPU machines (because of the need to move memory between the CPU and the GPU) is fixed as you add directives incremently.

OpenACC parallel and loop Directives

A key to understanding OpenACC is understanding what happens when a parallel directive is encountered. This explanation will extend to other OpenACC clauses, as well.

When the compiler encounters a parallel directive, it generates one or more parallel gangs. In the case of OpenACC on a CPU-only system, a gang will have a single thread and, most likely, one thread per core (unless the processor supports simultaneous multithreading [SMT]). On a GPU, a gang is a collection of processing elements (threads) that can number into the hundreds. These gangs, unless told otherwise, will execute the code redundantly. That is, each gang executes the exact same code (Figure 1). Processing proceeds from the top down for each gang.

Figure 1: Parallel directives and gangs (from OpenACC.org).

Inside the parallel directive, if you have a loop as shown in Table 3, each gang will execute each loop redundantly (i.e., each gang will run the exact same code). These gangs are executed at the same time but independently of one another (Figure 2).

Table 3: Gang Execution

Fortran C
!$acc parallel
   do i=1,n
      ! do something
   enddo
!$acc end parallel
 
 
#pragma acc parallel
{
   for (int i=0; i < n; i++)
   {
      # do something
   }
}

Figure 2: Gangs running loops (from OpenACC.org).

Having each gang compute the exact same thing is not an efficient use of resources. Adding a loop directive to the code (Table 4) tells the compiler that the loop code can be run in parallel across the gangs.

Table 4: loop Directive

Fortran C
!$acc parallel
   !$acc loop
   do i=1,n
      a(i) = 0.0
   enddo
!$acc end parallel
 
 
#pragma acc parallel
{
   #pragma acc loop
   for (int i=0; i < n; i++)
   {
      a[i] = 0.0
   }
}

When the compiler encounters the parallel directive, it gets ready to create parallelized code and loops for OpenACC parallelization directives by informing the compiler which loops to parallelize. In the above code, it encounters the loop code and creates parallelized code that is split across the gangs or threads as evenly as possible.

Adding only two lines to the code tells the compiler that you have a parallel region and a loop that can be parallelized. That is all you have to do. At this point, the compiler determines how best to parallelize the loop given the target processor architecture. This is the power of OpenACC.

Figure 3: Gangs running parallelized loop code (from OpenACC.org).

Notice in Figure 3 that each gang runs a number of threads. All of this parallelization is created by the compiler using the directives in the code. A best practice is to combine parallel and loop in one directive covering the loops that are to be optimized.

Because in OpenACC the compiler uses the directives to parallelize the code, it will have a difficult time parallelizing if (1) the loop can’t be parallelized or (2) you don’t give it enough information to make a decision about parallelizing a loop. The compiler implementing the OpenACC standard should err on the side of caution and not try to parallelize a loop if it detects problems or if it is not sure.

If you have code with a loop that the compiler cannot parallelize (e.g., Listing 1), you might see compiler output like:

437, Complex loop carried dependence of e_k prevents parallelization
Loop carried reuse of e_k prevents parallelization

Listing 1: Unparallelizable Code

do k=1,9
   e_k(1) = real(e((k-1)*2+1))
   e_k(2) = real(e((k-1)*2+2))
   DP = e_k(1)*U_x(i,j) + e_k(2)*U_y(i,j)
   f_eq = w(k)*rho(i,j)*(1.0+3.0*DP+(9.0/2.0)*(DP**2.0)-(3.0/2.0)*DPU)
   f(i,j,k)=f(i,j,k)-(1.0/tau)*(f(i,j,k)-f_eq)
enddo

 

Although it is a very small loop of seven lines, it illustrates data dependencies that can cause the compiler to refuse to parallelize the loop(s).

OpenACC Programming Approach

In this article I discussed only one directive, loop. You can affect performance a great deal if you look for loop parallelism in your code and start using this directive. There is a recommended approach to using OpenACC directives, including the loopdirective.

Although I do not want to change your coding style, ideally, adding directives to code should be driven by profiling and tracing. Profiling determines the routines where most of the run time is spent, expressed as a simple table: the routine name and how much time was spent in that routine. Then you stack-rank the times, and the top routines are the initial primary focus. Tracing is a timeline examination of what is happening in the application. In the case of accelerators, this includes data movement to and from accelerators.

With the initial list of target routines in hand, you can start adding directives to your code. Generally, adding only one directive at a time is recommended. By incrementally adding directives, you can understand the effect each makes on run time. You add a directive, rebuild the code, run the code, test it to make sure the answers are correct, and then look at the effect on performance.

While adding loops to your code, if run time goes up, don't despair. This can happen because of bad data movement to and from the CPU and the accelerator. (I will cover this topic in an upcoming article.) You need to focus on the accuracy of the output. If the output is not correct, then you might need to change the directive, change the code, or even drop the loopdirective.

Summary

OpenACC directives allow you to take serial code and “port” it to multicore CPUs or accelerators such as GPUs. These directives appear to be comments to the compiler, unless the compiler understands the directives, which allows you to use one version of code – reducing the chance of errors and keeping code size down – and build it with your usual compiler. If a compiler understands OpenACC, then simply adding specific flags to your compile line will allow you to build and run with multiple CPU cores or accelerators.

Performance improvements are achieved by locating regions that can be parallelized in your application. A classic approach is to find loops that can be parallelized. This article tackled the parallel and loop OpenACC directives and clauses for both Fortran and C. A best practice is to combine parallel and loop in one directive (i.e., !$acc parallel loop) for a loop nest you want to parallelize and, if the compiler fails to parallelize a loop within that nest, de-nest the troublesome loop and put a !$acc loop directive before it, which allows you to optimize the directives for that specific loop.

When using loop parallelization, it is best practice to focus on the routines that use the most run time. In this fashion, you’ll make the biggest dent in run time with the fewest directives. A simple profile of the application can provide you with the stack rank of the most time consuming routines.

As you add directives to your application, be sure to check the output. It should match the application output when the application isn't built with directives. This is absolutely vital to improving your application. Running fast but not getting the correct output is worthless. Running on processors other than CPUs can produce slightly different output. Therefore you don't want to do a bit-for-bit comparison; rather, you want to compare the output for significant differences. Defining what is “significant” is up to the user and the application, but should not be taken lightly.

OpenACC has a number of directives other than parallel and loop to help you port applications. However, these directives allow you to start attacking “hot spots” in your code immediately to improve performance. Learning just two OpenACC clauses in exchange for improving parallelization through loops isn't a bad trade.

In the next OpenACC article, I discuss data usage, focusing on how you can consider it in combination with parallel loops to get even better performance.