OpenMP – Coding Habits and GPUs

Jeff Layton

06/13/2019 12:31 pm

In this third and last article on OpenMP, we look at good OpenMP coding habits and present a short introduction to employing OpenMP with GPUs.

When first using a new programming tool or programming language, it’s always good to develop some good general habits. Everyone who codes with OpenMP directives develops their own habits – some good and some perhaps not so good. As this three-part OpenMP series finishes, I highlight best practices from the previous articles that can lead to good habits.

Enamored with new things, especially those that drive performance and scalability, I can’t resist throwing a couple more new directives and clauses into the mix. After covering these new directives and clauses, I will briefly discuss OpenMP and GPUs. This pairing is fairly recent, and compilers are still catching up to the newer OpenMP standards, but it is important for you to understand that you can run OpenMP code on targeted offload devices (e.g., GPUs).

Nested Loops Part 2

In the first OpenMP article, I briefly touched on nested loops from the perspective of ordering the loops to get the most parallelism possible. Remember that the directive omp do/for only parallelizes one loop: the loop right after the directive. To create even more parallelism in code, OpenMP has added a clause to the omp do/for directive.

The clause is of the form omp do/for collapse(N), where N is the number of loops in a nested loop that should be collapsed into one “iteration space.” The iteration space is then divided with the schedule clause, if provided.

A simple example of using the collapse clause is the code snippet in Listing 1. Note that for this simple code, OpenMP will collapse the two loops together to create a larger iteration space; then, it can take that iteration space and run it across the threads as directed.

Listing 1: collapse Clause

Fortran	C
!$omp do collapse(2) do i=1,N do j=1,N A(i,j) = B(i,j) + C(i,j) enddo enddo	#pragma omp for collapse(2) for ( i = 0; i < N; i++ ) { for ( j = 0; j < N; j++ ) { A[i][j] = B[i][j] + C[i][j] } }

The collapse clause has some restrictions. One is that you cannot put any statements between the loops (i.e., no code between the i and j loops in the example). Reordering or refactoring code to keep the loops next to each other can pay off, especially when targeting the GPUs discussed later. Regardless, CPUs are reaching 64 cores per socket, allowing more scalability, so collapse can help on CPUs, as well.

SIMD

If you remember your college math classes, you will recall that a scalar is a single number, whereas a vector is a list or array of numbers. Many processors today, such as the Intel Xeon Scalable CPU, IBM Power CPU, some AMD CPUs, and newer ARM CPUs have the capability of operating on vectors of numbers, rather than just a single number, at the same time. Single instruction, multiple data (SIMD) is another way of saying “vector operations” or “data parallelism.”

Why do you care about vectorization? Fundamentally, it adds more parallelism to the code, which improves performance and scalability. Vectorization applies the same set of instructions to a “pipeline” of data, improving performance. Most likely the data is taken from the L1 or L2 cache, which is much faster than regular memory. Today’s cores and compilers can help you vectorize your code with auto-vectorization (so you don’t have to vectorize by hand).

Starting in OpenMP 4.0, the directive omp simd was introduced. It basically gives a hint to the compiler to look for places in which the next loop can take advantage of any vector processing support that the hardware provides. Much of this work is done by the compiler analyzing the code to create as much SIMD as possible for the cores. However, omp simd is not a magic directive that you can sprinkle on your code to make it run faster; you need to pay attention to some gotchas:

Application performance will be sensitive to
- hardware vector length,
- data being in-cache (cache misses can be a killer on performance), and
- proper cache alignment.
Code vectorization results in changes in the order of operations within a loop.
However, program semantics must be maintained so that each result is independent of the previous results.
Therefore, the compiler will be conservative and produce correct results even at the expense of vectorizing.

If the compiler can't vectorize certain loops, you might have to rework the code to reduce array dependencies, which is beyond the scope of this article; however, you can find materials online that explain how to do this.

For completeness, the omp simd directives for Fortran and C are shown in Listing 2.

Listing 2: omp simd Directive

Fortran	C
!$omp simd do i=1,N ... enddo !omp end simd	#pragma omp simd for ( i = 0; i < N; i++ ) { ... }

For Fortran you need the end simd directive to end the region. For C/C++, the end of the loop marks the end of the region. You can also combine directives and put omp simd on the same line as omp do/for.

Since OpenMP 4.0, OpenMP 4.5 and 5.0 have added clauses to omp simd to give the compiler more hints on how to vectorize the code. These clauses can get complicated, so I won't discuss them here.

To determine whether the compiler is vectorizing code, look at the compiler output. First, be sure to use an option that produces “verbose” output from the compiler. Then, look for output that tells you whether the compiler was able to vectorize the loop or not (e.g., “OpenMP SIMD LOOP WAS VECTORIZED” or “loop was not vectorized: existence of vector dependence”). This feedback allows you to go back into the code to look for data dependencies or other issues that prevent vectorization.

Depending on the language you use and your code, vectorization (omp simd) can be awesome, but it can also create issues. A tutorial makes the following observations:

Incorrect results are possible
Poor performance is possible
Memory errors are possible

To reduce the probability of any of these happening, be sure to test the output of your code for correctness and check the run time of the code.

As a final note, the directive omp simd is relatively new, so you will need to use more recent versions of GCC, Clang, or commercial compilers. The compilers that come standard with various distributions probably won’t understand the directive. However, because the directive appears to be a comment, the compiler will just ignore it. Once you get access to a compiler that understands it, you’re just a recompile away with no code changes.

Fewest Parallel Regions

The first habit I’ve gotten into is using the fewest parallel regions possible. Just the creation and destruction of a parallel region can take tens of microseconds. Although it doesn’t sound like much, creating and destroying parallel regions thousands or tens of thousands of times or more can cause a measurable delay in code execution and can reduce the scalability of the code according to Amdahl’s Law. In effect, the code becomes “serial time” that cannot be reduced by using more processors.

To determine how to balance the number of times a parallel region is created and destroyed compared with how much time is spent in the parallel region, I found a somewhat vague rule of thumb that says: The code inside a parallel region should run “several” times longer than the time to create or destroy a parallel region. I’m not quite sure how much longer the execution time should be, but the longer, the better.

Be Safe and Think Debugging

As you port your code to OpenMP, you should remember a few items with regard to the code. By default, the behavior for parallel regions and worksharing is that any variable defined outside of the parallel region is usually shared. If the variable is defined only in the parallel region, it is private. It is very important to understand which variable uses which data parallelism.

Relying on the default behavior of OpenMP can be dangerous because you have to play very close attention to what is shared and what is private. A possible best practice is to use the default(private) directive at first, which forces all variables in each thread to be private: No variable values are shared between threads.

Another option, which can also help with debugging, is to use default(none), which forces you to define whether each variable in the OpenMP region is either private or shared. If you don’t define the variable, the compiler will throw a compiler error telling you the variable needs to be defined (i.e., the code won't compile), providing a great way track down each variable in the parallel region/worksharing and define them.

The difference between the two clauses default(private) and default(none) is that the last clause forces you to define every variable, including shared variables. The first option defines all variables as private and does not allow shared variables. You have to decide which variable option works for specific variables (i.e., know thy code).

The best practice that I use is to start with default(private) to make sure the threads are created correctly and that at least the initial portions of the parallel code is executed correctly. I don’t worry too much about the correctness of the output because I’m just checking that everything starts correctly. Then, I switch to default(none) and define all variables, including shared and private variables. I make sure the code compiles and the output is correct, then I start debugging and working on improving performance and scalability.

Best Practices for schedule

In the second OpenMP article of the series, the concept of a schedule clause was introduced that tells the compiler how to schedule the blocks of work across the threads. The form of the Fortran clause is:

!$omp do schedule(kind[, chunksize])

The option kind has five possible values:

static
dynamic
guided
auto
runtime

Each schedule has different properties, as discussed in that same article. The question is: Which one is “best” for running your code?

As a review, the second option for the schedule clause is the chunksize. If specified, the work is divided into chunks of that size. A quick review of the schedule options are:

static: If chunksize is specified, the iteration space is divided into chunks, each of chunksize iterations, and the chunks are assigned cyclically to each thread in order (block cyclic schedule); otherwise, the iteration space is divided into the number of threads.
dynamic: This schedule divides the iteration space up into chunks of size chunksize and puts them into a task queue. The chunks are assigned to unoccupied threads on a first come, first served basis out of the task queue (default for chunksize is 1).
guided: The guided schedule is similar to dynamic, but the chunks start off large and get smaller exponentially. The size of the next chunk is proportional to the number of remaining iterations divided by the number of threads, where the chunksize specifies the minimum size of the chunks. The default chunksize is 1.
auto: The auto schedule lets the run time decide, on its own, the assignment of iterations to threads. For example, if the parallel loop is executed many times, the run time can evolve a good schedule with some good load balance characteristics and low overheads. However, creating a good schedule can take time.
runtime: This option defers the scheduling decision until run time and is defined by an environment variable (OMP_SCHEDULE) that allows you to vary the schedule simply by changing the environment variable. You cannot specify a chunksize for this schedule.

With a static schedule and the default chunksize, the loop iteration range is divided into equal-sized chunks. For example, if you have four threads, then thread 0 will get the first quarter, thread 1 the second quarter, thread 2 the third quarter, and thread 3 the fourth quarter of the iteration range. This approach tends to improve cache utilization because the processor will grab subsequent memory locations, usually enough to fill a cache line, providing much faster data access for data in subsequent iterations.

As you can probably discern, using the static schedule works well when the threads have close to the same amount of work (uniform work) to perform, but when different iterations require differing lengths of time (i.e., the per-iteration workloads are non-uniform), using the static schedule might not be ideal. Instead, one of the other schedules would be a better choice because they address non-uniform workloads and can often improve processing performance of the workload. For example, a dynamic schedule with a smaller chunksize will do a better job at load balancing with varying number of iterations that have different run times.

A dynamic schedule, however, is not just a “free change” to a different schedule. While the application is running, the OpenMP run time has to make some decisions, which can slow down the application. On the other hand, a static schedule doesn’t make decisions during run time (only at compile time) but might not be the most efficient scheduling of iteration ranges.

I typically use the static scheduler, which assigns loop iterations to match the number of threads; however, a number of people have developed best practices for choosing a schedule. By the way, the default schedule is implementation dependent, so it’s not a bad idea to specify which one you want to use by default.

Viktor Eijkhout of TACC (Texas Advanced Computer Center) has created OpenMP documentation with some schedule recommendations:

Using the static schedule (#pragma omp for schedule(static[,chunk]) could deliver good load balancing if the iterations take very different amounts of time to execute, although it does carry run-time overhead for managing the queue of iteration tasks.
Using the clause option schedule guided gradually decreases the chunk size. The thinking here is that large chunks carry the least overhead, but smaller chunks are better for load balancing.
If you don’t want to decide on a schedule in your code, you can specify runtime and the schedule will then be chosen at run time by being read from the OMP_SCHEDULE environment variable. You can even just leave it to the run-time library by specifying the schedule as auto.

GPU target

One of the newest and most exciting features of OpenMP is the target directive, which offloads execution and associated data from the CPU to the GPU (accelerator), also referred to as the target device (hence the directive name). The target device owns the data, so accesses by the CPU during the execution of the target region is forbidden.

The general model for target is a single host and one or more target devices (accelerators). The “device” is an implementation-defined logical execution unit. Classically, each accelerator has its own local data storage, often called the “device data environment.” Data used within the offload or target region may be implicitly or explicitly mapped to the device. Within the accelerated (targeted) region, all OpenMP directives are allowed, but only a subset will run well on GPUs.

The general execution mode is host-centric. To begin, the host creates the data environments on the device(s). The host then maps data to the device data environment, which is data movement to the device. The host then offloads OpenMP target regions to the target device; that is, the code is executed on the device. After execution, the host updates the data between the host and the device, which is transferring data from the device to the host. The host then destroys the data environment on the device.

I don’t have any experience using the target directive yet. Most of what I can suggest is based on other peoples’ work, but I think it’s important to realize that OpenMP is expanding to include accelerator support.

Directives for Executing on a Target Device

Two primary directives execute code on a target device:

omp target …
omp declare target …

The first directive is for structured blocks and comes with clauses. The second option is for function definitions or declarations. The first directive is what you typically use to define a region of code to be run on a device.

Using the directive omp target will cause the compiler to move the region of code to the GPU and implicitly map the data from the host to the offload device. This directive is the easiest to use to offload code to the GPU (device).

Map Variables to a Target Device

Directives allow you to get the data from the host to the device – and back to the host, if needed. These directives explicitly move the data:

omp map([map-type:] list)map-type := alloc | tofrom | to | from | release | delete
omp target data … (for a structured block)
omp target update …
omp declare target …

These directives are quite commonly used in explicitly moving data back and forth from the host to the device inside a region of code that executes on the device.

Pay particular attention to the map directive. The map-type := alloc allocates (creates) data storage on the device and will only be used on the device. The map-type := tofrom means the data is copied from the host to the device at the beginning of the region and from the device to the host at the end of the region.

The map-type := to copies data from the host to the device, but no data is returned (i.e., if the data is changed, the updates are lost once the code region is exited). The map-type := from copies data that has been created and updated on the device (most likely with alloc) to the host.

The directive target data offloads data from the CPU to the GPU – but not code execution. The target device owns the data so that access by the CPU during the execution of the contained target regions are forbidden (i.e., the CPU cannot access the data on the GPU).

By default, if you don’t use map, the compiler will default to tofrom; that is, each variable will be copied from the host to the target device at the beginning of the target region. At the end of the target region, the data is copied back from the target device to the host.

Workshare for Acceleration

OpenMP uses workshare concepts to process data on the CPU, which is also true for acceleration offloads:

omp teams … (structured block)
omp distribute … (for loops)

These directives can be very useful for creating more parallelism for the compiler to address, which makes code scale and perform better.

The omp teams directive spawns one or more thread teams, each with the same number of threads. Recall that the omp parallel directive only creates a single thread team. This directive allows you to create multiple teams.

After the directive, execution continues on the master thread of each team, so further loop parallelization can be accomplished. However, be careful, because there is no synchronization between teams.

The omp distribute directive distributes the iterations of the next loop to the master threads of the teams. These iterations are distributed statically, with no guarantee about the order the teams will execute (the same as with static schedule in OpenMP), including no guarantee that all teams will execute simultaneously. One important thing to note is that this directive does not generate parallelism/worksharing with the thread teams.

GPU Example

Listing 3 is a simple example to illustrate how to use the targetconstruct to transfer control from the host to the target device. It also “maps” variables between the host and target device data environments. Notice how similar the directives are between C and Fortran.

Listing 3: target Construct

Fortran	C
!$omp target teams distribute map(to:b,c,d) map(from:a) !$omp parallel do do i=1,count a(i) = b(i) * c + d enddo !$omp end distribute !$omp end teams !$omp end target	#pragma omp target teams distribute map(to:b,c,d) map(from:a) { #pragma omp parallel for for (i = 0; i < count; i++) { a[i] = b[i] * c + d; } }

Fortran

!$omp target teams distribute map(to:b,c,d) map(from:a)
!$omp parallel do
   do i=1,count
      a(i) = b(i) * c + d
   enddo
!$omp end distribute
!$omp end teams
!$omp end target

#pragma omp target teams distribute map(to:b,c,d) map(from:a)
   {
#pragma omp parallel for
      for (i = 0; i < count; i++) {
         a[i] = b[i] * c + d;
      }
   }

In this example, the first directive defines the beginning of an offload region using the target directive. The directives/clauses teams and distribute are used to create more parallelism so that more of the threads on the GPU can be used.

The map directives tell the compiler how to copy the data to and from the host to the target device (and vice versa). In this case, it maps the variables b, c, and d from the host to the device. However, these variables are not “copies” back to the host, so any changes are lost (i.e., kind of like function input data). The second map directive only copies the array a[] from the target device to the host at the end of the region. Notice that the host will create array a[] on the device but not copy anything to it.

After this line of directives, you can start using other OpenMP directives to tell the compiler what you want to do on the device. In this case, it is to parallelize the loop with the omp parallel do/for directives.

Putting It All Together

The goal of the three OpenMP articles in this series was to present some directives and clauses that you can use to start improving the performance and scalability of your code, particularly serial code. Now, you should be able to pull everything together, from profiling your application to determine which code to parallelize, to using the various directives and clauses presented in the series.

As discussed in this article, modern processors have become effective vector processors that take a single set of instructions and apply them to a “pipeline” of data (i.e., SIMD). Also, I touched on the somewhat new directive target that allows you to run OpenMP code on targeted offload devices, such as GPUs. Compilers are still evolving to use this directive effectively, but if possible, you should follow compiler development and start practicing how to use GPUs effectively with OpenMP.

In the discussion about GPUs and OpenMP, the omp teams and omp distribute directives can help the compiler add parallelism to code. Although these directives can help CPU-based hardware, they are almost mandatory for GPU target offloads because of the huge number of threads.

Acknowledgment

I thank Professor Joel Adams from Calvin College (Grand Rapids, MI) for quickly finding bugs in the OpenACC and OpenMP article series and letting me know before I embarrassed myself too badly. Thanks, Joel!

The Author

I have been in the HPC business for almost 25 years (starting when I was four years old). You can find me lounging around a nearby Fry’s enjoying the coffee and waiting for sales.

Tags: GPU , OpenMP , parallel processing , parallel programming