OpenMP – Coding Habits and GPUs

GPU target

One of the newest and most exciting features of OpenMP is the target directive, which offloads execution and associated data from the CPU to the GPU (accelerator), also referred to as the target device (hence the directive name). The target device owns the data, so accesses by the CPU during the execution of the target region is forbidden.

The general model for target is a single host and one or more target devices (accelerators). The “device” is an implementation-defined logical execution unit. Classically, each accelerator has its own local data storage, often called the “device data environment.” Data used within the offload or target region may be implicitly or explicitly mapped to the device. Within the accelerated (targeted) region, all OpenMP directives are allowed, but only a subset will run well on GPUs.

The general execution mode is host-centric. To begin, the host creates the data environments on the device(s). The host then maps data to the device data environment, which is data movement to the device. The host then offloads OpenMP target regions to the target device; that is, the code is executed on the device. After execution, the host updates the data between the host and the device, which is transferring data from the device to the host. The host then destroys the data environment on the device.

I don’t have any experience using the target directive yet. Most of what I can suggest is based on other peoples’ work, but I think it’s important to realize that OpenMP is expanding to include accelerator support.

Directives for Executing on a Target Device

Two primary directives execute code on a target device:

  • omp target 
  • omp declare target 

The first directive is for structured blocks and comes with clauses. The second option is for function definitions or declarations. The first directive is what you typically use to define a region of code to be run on a device.

Using the directive omp target will cause the compiler to move the region of code to the GPU and implicitly map the data from the host to the offload device. This directive is the easiest to use to offload code to the GPU (device).

Map Variables to a Target Device

Directives allow you to get the data from the host to the device – and back to the host, if needed. These directives explicitly move the data:

  • omp map([map-type:] list)map-type := alloc | tofrom | to | from | release | delete
  • omp target data … (for a structured block)
  • omp target update 
  • omp declare target 

These directives are quite commonly used in explicitly moving data back and forth from the host to the device inside a region of code that executes on the device.

Pay particular attention to the map directive. The map-type := alloc allocates (creates) data storage on the device and will only be used on the device. The map-type := tofrom means the data is copied from the host to the device at the beginning of the region and from the device to the host at the end of the region.

The map-type := to copies data from the host to the device, but no data is returned (i.e., if the data is changed, the updates are lost once the code region is exited). The map-type := from copies data that has been created and updated on the device (most likely with alloc) to the host.

The directive target data offloads data from the CPU to the GPU – but not code execution. The target device owns the data so that access by the CPU during the execution of the contained target regions are forbidden (i.e., the CPU cannot access the data on the GPU).

By default, if you don’t use map, the compiler will default to tofrom; that is, each variable will be copied from the host to the target device at the beginning of the target region. At the end of the target region, the data is copied back from the target device to the host.

Workshare for Acceleration

OpenMP uses workshare concepts to process data on the CPU, which is also true for acceleration offloads:

  • omp teams … (structured block)
  • omp distribute … (for loops)

These directives can be very useful for creating more parallelism for the compiler to address, which makes code scale and perform better.

The omp teams directive spawns one or more thread teams, each with the same number of threads. Recall that the omp parallel directive only creates a single thread team. This directive allows you to create multiple teams.

After the directive, execution continues on the master thread of each team, so further loop parallelization can be accomplished. However, be careful, because there is no synchronization between teams.

The omp distribute directive distributes the iterations of the next loop to the master threads of the teams. These iterations are distributed statically, with no guarantee about the order the teams will execute (the same as with static schedule in OpenMP), including no guarantee that all teams will execute simultaneously. One important thing to note is that this directive does not generate parallelism/worksharing with the thread teams.

GPU Example

Listing 3 is a simple example to illustrate how to use the targetconstruct to transfer control from the host to the target device. It also “maps” variables between the host and target device data environments. Notice how similar the directives are between C and Fortran.

Listing 3: target Construct

Fortran C
!$omp target teams distribute map(to:b,c,d) map(from:a)
!$omp parallel do
   do i=1,count
      a(i) = b(i) * c + d
   enddo
!$omp end distribute
!$omp end teams
!$omp end target
#pragma omp target teams distribute map(to:b,c,d) map(from:a)
   {
#pragma omp parallel for
      for (i = 0; i < count; i++) {
         a[i] = b[i] * c + d;
      }
   }
 

In this example, the first directive defines the beginning of an offload region using the target directive. The directives/clauses teams and distribute are used to create more parallelism so that more of the threads on the GPU can be used.

The map directives tell the compiler how to copy the data to and from the host to the target device (and vice versa). In this case, it maps the variables bc,  and from the host to the device. However, these variables are not “copies” back to the host, so any changes are lost (i.e., kind of like function input data). The second map directive only copies the array a[] from the target device to the host at the end of the region. Notice that the host will create array a[] on the device but not copy anything to it.

After this line of directives, you can start using other OpenMP directives to tell the compiler what you want to do on the device. In this case, it is to parallelize the loop with the omp parallel do/for directives.

Further Reading

You can find some good tutorials and talks for OpenMP and GPUs online, including:

From these presentations, and others, a few comments stand out:

  • GPUs are not CPUs.
  • OpenMP for a GPU will not look like OpenMP for a CPU.
  • Aggressively collapse loops to increase available parallelism.
  • Use the target data directive and map clauses to reduce data movement between the CPU and GPU.
  • Use accelerated libraries whenever possible.

These comments illustrate that the authors of these talks have spent some time with GPUs and OpenMP.

Putting It All Together

The goal of the three OpenMP articles in this series was to present some directives and clauses that you can use to start improving the performance and scalability of your code, particularly serial code. Now, you should be able to pull everything together, from profiling your application to determine which code to parallelize, to using the various directives and clauses presented in the series.

As discussed in this article, modern processors have become effective vector processors that take a single set of instructions and apply them to a “pipeline” of data (i.e., SIMD). Also, I touched on the somewhat new directive target that allows you to run OpenMP code on targeted offload devices, such as GPUs. Compilers are still evolving to use this directive effectively, but if possible, you should follow compiler development and start practicing how to use GPUs effectively with OpenMP.

In the discussion about GPUs and OpenMP, the omp teams and omp distribute directives can help the compiler add parallelism to code. Although these directives can help CPU-based hardware, they are almost mandatory for GPU target offloads because of the huge number of threads.