Parallelizing Code – Loops

OpenACC

OpenACC was originally developed to add accelerator device support that was missing from OpenMP at the time. The design goals for OpenACC are a bit different from OpenMP. OpenACC takes a descriptive approach by using directives to describe the properties of the parallel region to the compiler, which then generates the best code possible to meet the description on which you plan to run.

The goal of OpenACC was to support a wide range of accelerators, including multicore CPUs. Currently it supports

  • POWER CPU
  • Sunway
  • x86 CPU
  • x86 Xeon Phi
  • Nvidia GPU
  • PEZY-SC

As with OpenMP, OpenACC allows you to use a single codebase, which can reduce errors from the introduction of new code. To compilers, the directives just look like comments. OpenACC uses parallel directives (regions that are parallelizable), data directives (data movements to/from the accelerator devices), and clauses. Fundamentally, OpenACC requires that the parallel loop be free of any data dependencies, which sometimes requires loops to be rewritten. When such a code refactoring is required, the resulting code often runs faster both with and without the directives.

OpenACC uses the concept of “gangs” or “workers” instead of “threads.” Gangs are similar to threads, but they operate completely independently of each other without the ability to synchronize, which explains the requirement that loops be free of data dependencies. This setup enables the loop to run in parallel on the smallest and the largest of parallel processors without change.

The run-time environment will select how that code is mapped to gangs on the target architecture. For example, on CPUs, the gangs are mapped to cores. For GPUs, the gangs are mapped to the GPU processors. OpenACC can also use multiple gangs or combinations of gangs and lower level parallelism (to be covered later).

Parallel Computing

Classically, applications were written to be executed serially. One computation was performed after another. But this approach doesn't take into account that some computations or regions can be computed simultaneously. Finding and parallelizing such regions of an application allows it to run faster and scale better than serial applications (see Amdahl’s Law).

Today's processors have multiple cores, and accelerators such as GPUs have thousands of lightweight cores that can be used, as well. At a simplistic level, parallelization breaks a problem into discrete parts that can be solved simultaneously. Each part is then executed on different processors but with some sort of coordination.

One likely place for parallelization to occur is in loops. In this simple Fortran loop

do i = 1,n
   z(i) = x(i) * y(i)
enddo

each value z(i) is not dependent on previous values of z(i). Therefore, all values of z(i) can be computed in any order or at the same time. If the upper limit of the loop, n, is large enough, some processing hardware can greatly speed up the computation.

What happens if z(i) depends on a previous value, as in the following:

do i = 2,n
   z(i) = z(i-1)*2
enddo

As written, you can’t generally parallelize the loop because of data dependency. This dependency is also called loop-level parallelism. However, for this particular example, you could rewrite the loop in a way that can be parallelized:

do i = 2,n
   z(i) = z(i)*2**(i-1)
enddo

When the compiler tells you a loop cannot be parallelized, you, the programmer, will need to determine whether it can be refactored in such a way that it can be parallelized.

In other situations, you might have to pay attention to race conditions, mutual exclusion, and parallel slowdown. For the purposes of this article, I won’t cover these situations or conditions.