In the Loop

Diving deeper into OpenMP loop directives for parallel code.

In the first article in this series, the focus was on creating teams and groups of threads and parallelizing a loop across them with OpenMP directives. Just two directives got you started porting your serial code to OpenMP. In this article, I’m going to go a little deeper into those two directives and talking more about teams and loops, including some options (called “clauses”) for the parallel loop directive. Other parallel loop clauses lend more flexibility, including help scheduling work for the threads in the team.

Revisiting Loops

The parallel directive creates a team of threads that defines a region of code to be parallelized; then, with OpenMP directives, you can split a do/for loop across the team. This construct is key for OpenMP parallelism, and a few directive clauses can be exploited to provide additional control and possibly additional performance.

Data and Control Parallelism

A quick review of the do/for OpenMP directive brings up some points that should be clarified. The use of directives to take serial regions of code and make them parallel, along with all of the cores (there are ways to restrict the number of cores), is the way to improve performance.

In Figure 1, the threads in the team that operate on the data are shown in green and are being used for “data parallelism.” However, you don’t have to use every thread in the team; instead, you can take one or more threads and use them to perform different functions. These special threads generally can be thought of as “control parallelism.” A simple example is using one thread for performing I/O while other threads are used for computation. These control threads are shown as blue, red, and yellow boxes in Figure 1. The combination of data and control parallelism results in a great deal of flexibility in porting your application to OpenMP.

Figure 1: Data and control parallelism in OpenMP.

The one or two control threads allow you to overlap computation with I/O or other control-like functions, improving the overall parallelism of your application. Given that processors can have up to 64 cores, using one or two to allow your code to scale better is a great trade-off.

Although I won’t specifically discuss how to use data parallelism and control parallelism at the same time, as you port your code to OpenMP, look for opportunities where you might use a thread or two for a control function that would allow the remaining cores to be used for computation. Once you have your code ported to parallelize your loops, you can come back and revisit the use of control threads.

Teams and Loops

To improve the performance of your code, you want to parallelize loops as much as possible (i.e., split the work in the loop across as many threads as possible). With a single loop in the code, you can use a combination of directives (e.g., !$omp parallel do or #pragma omp parallel for) to parallelize the loops. The example in Listing 1 shows sets of loops one after the other, and Listing 2 shows how to parallelize each loop with the !$omp parallel do or #pragma omp parallel for directive.

Listing 1: Sets of Loops

Fortran C
do i=1,n
...
end do
 
do j=1,m
...
end do
 
do k=1,n*m
...
end do



for (i=0; i < n; i++)
{
   ...
}
 
for (j=0; j < m; j++)
{
   ...
}
 
for (k=0; k < n*m; k++)
{
   ...
}

Listing 2: Parallelized Sets of Loops

Fortran C
!$omp parallel do
do i=1,n
...
end do
!$omp end parallel do
 
!$omp parallel do
do j=1,m
...
end do
!$omp end parallel do
 
!$omp parallel do
do k=1,n*m
...
end do
!$omp end parallel do
#pragma omp parallel for
   for (i=0; i < n; i++)
   {
      ...
   }

#pragma omp parallel for
   for (j=0; j < m; j++)
   {
      ...
   }
 
#pragma omp parallel for
   for (k=0; k < n*m; k++)
   {
      ...
   }

Between each loop directive, a team of threads is created (forked), and the data is copied to each thread. After computation, the threads are synchronized and destroyed (joined). This process has to be done for each loop, adding computational time that is not parallelizable and reducing performance and scalability. Ideally, you want to spend the minimum amount of time in fork and join operations as possible. For example, the code in Listing 3 illustrates how to reduce a series of loops into a single fork/join structure.

Listing 3: Single Fork and Join

Fortran C
!$omp parallel
 
!$omp for
 do i=1,n
...
end do
!$omp end do
 
!$omp do
do j=1,m
...
end do
!$omp end do
 
!$omp do
do k=1,n*m
...
end do
!$omp end do
 
!$omp end parallel

#pragma omp parallel
{
 
   #pragma omp for   
      for (i=0; i < n; i++)
      {
         ...
      }   
 
   #pragma omp for   
      for (j=0; j < m; j++)
      {
         ...
      }
   
   #pragma omp for   
      for (k=0; k < n*m; k++)
      {
         ...
      }
   
}

Here, the parallel and do directives are separated from each other (they don’t have to be together), which allows a single fork/join operation to encompass all of the loops, creating just one team creation step (fork) and one team synchronization and thread destruction step (join). Fewer fork/joins reduce run time and improves scalability.

Clauses with Reductions

When creating a team, you have several clauses (options) available for debugging or data control when the omp parallel directive is used. The form of the clauses for Fortran and C, respectively, are,

!$omp parallel 
#pragma omp parallel 

where <clauses> can be any one (or more) of the following:

  • copyin(<list>allows all of the threads to access the master thread values in (<list>).
  • copyprivate(<list>specifies that one or more of the variables in (<list>should be shared among all threads.
  • default(shared|none) allows you to specify the default behavior in the parallel region. If you specify shared, then any variable in a parallel region will be treated as if it were specified with the shared clause. The option none means that any variables used in a parallel region that aren’t specified with privatesharereductionfirstprivate, or lastprivate will cause a compiler error.
  • firstprivate(<list>says each thread should have its own copy of the variable(s) specified in (<list>). In actuality, it gets a unique memory address in which to store values for that variable while it's used in the parallel region. When the region ends, the memory is released and the variables no longer exist.
  • if (scalar_logical_expression), if true, evaluates the region in parallel; otherwise, a single thread is used.
  • lastprivate causes the last value of a specified variable to be used after the parallel region.
  • nowait overrides the barrier that is implicit in a directive, including the implicit join at the end of a parallel region.
  • num_threads allows you to specify the number of threads in the team.
  • ordered is required if a parallel do or parallel for statement uses an ordered directive in the loop.
  • private(<list>tells the compiler that each thread should have its own instance of a variable. You can specify a <list>of variables that are to be private.
  • reduction(operator:<list>is an important clause for scientific code specifying that one or more variables private to each thread can be used for a reduction operation at the end of a parallel region. Classic operators, including addition (+), subtraction (-), maximum (max), minimum (min), and multiplication (*), are available. Other operations (e.g., logical operators) also can be used.
  • schedule(<type>[,<size>]) applies to work schedules for the parallel do or parallel for loop directives.
  • shared(<list>) allows you to specify that one of more variables in <list> are to be shared among all of the threads.

As you can see, a number of clauses give you a great deal of freedom. Rather than focus on all of them, I will only discuss a few that I've found useful and a few that can be confusing.

default

The default clause is probably one of the best debugging clauses, because it allows you to specify the behavior of unscoped (undefined) variables in a parallel region. In particular, default(none) requires you to define every single variable in the parallel region as private or shared. If you introduce a new variable in the parallel region and don’t define it, the compiler will give you an error, requiring you to define it in a directive clause. Compare this is with the implicit none command in Fortran.

firstprivate and lastprivate

Although I really don’t use these two clauses in my coding, they can be confusing, so I’ll talk about them a bit. Be definition, firstprivate means that each thread should have its own instance of a variable (memory location) and  should be initialized with a value right before the parallel region.

In practice, firstprivate means that each thread will start with a variable that has the same value (rather than starting as 0). The variable can be changed by each thread as needed. During the join process after the parallel region, the variable on the master thread returns to its value before the parallel region.

The lastprivate clause means that threads during the fork phase get the same value, which can be changed in the parallel region. During the join phase, the master thread sets the value of the variable to whichever thread executes the final iteration (think of a do/for loop) or the last section (I haven’t discussed sections thus far in OpenMP).

The firstprivate and lastprivate clauses are best illustrated with an example, which I base on code by Michael Lindon (Listing 4). If you run the code as written with the private() clause, you’ll see the output in Listing 5.

Listing 4: private() Example

Fortran C
program main
   use omp
 
   integer :: i
   integer :: x
 
   x = 44
 
!$omp parallel for private(x)
   do i=1,10
      x = i
      write("Thread number: ", omp_get_thread_num(),"    x: ",x)
   end do
 
   write(*,*) "x is ",x
end program main

#include 
#include 
#include 
 
int main(void){
   int i;
   int x;
   x=44;
 
#pragma omp parallel for private(x)
   for (i = 0; i <= 10; i++) {
      x = i;
      printf("Thread number: %d     x: %d\n", omp_get_thread_num(), x);
   }
   printf("x is %d\n", x);
}

Listing 5: private()Output

Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
x is 44

Notice that the value of varies for the threads, but the final value after the parallel region is still 44. The final value was not changed by the threads because it is a private variable for each thread.

Next, I’ll switch private() to firstprivate() (Listing 6). Recall that firstprivate specifies that each thread should have its version of the variable (independent from all others) initialized before the start of the parallel region. After the parallel region, the variable is returned to its original value, 44.

Listing 6: firstprivate() Output

Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 3     x: 9
Thread number: 3     x: 10
x is 44

Finally, I’ll switch firstprivate to lastprivate (Listing 7). The value of after the parallel region (join) is the last iteration – in this case, 10.

Listing 7: lastprivate() Output

Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
x is 10

Be careful as you use sharedfirstprivate, and lastprivate in your code. As illustrated in this example, you canget results you might not expect.

nowait

One important topic in parallelism is synchronization. As threads compute separately, they might at some time need to share information or all reach the same point before proceeding. Synchronization gives you control over how the threads run their assigned work, focusing on performance and correctness by telling the threads to process in a certain order.

OpenMP always has an implicit barrier when a parallel region ends (the join), which provides synchronization because all threads finish at the same time. The nowait clause allows you to cancel this implicit barrier effectively and can be particularly effective if one parallel loop is followed by several more and the threads don’t depend on each other. However, you have to be very careful about how the threads share data, or you can get wrong answers.

For example, assume a thread has finished a loop that has been defined with the nowait clause. The thread doesn't stop at the end of the parallel do/for and is destroyed at the end of the parallel loop. If that thread had any data that was needed by the other threads (e.g., for I/O), the data is not available, and you will get incorrect answers. Generally, this is called a “race condition.”

However, if you are absolutely sure you won't get a race condition, nowait allows the threads to run as fast as they can. An example taken from Microsoft documentation (Listing 8) illustrates how to do this. The output from the Fortran version of the code compiled with GCC is shown in Listing 9.

Listing 8: nowait Example

Fortran C
program test5
   integer :: a(5), b(5), c(5)
   integer :: i
   integer :: SIZE
 
   SIZE = 5
 
   do i=1,SIZE
      a(i) = i-1
   enddo
 
   call test(a, b, c, SIZE)
 
   do i=1, SIZE
      write(*,*) a(i),'  ',b(i),'  ',c(i)
   enddo
 
end program test5
 
subroutine test(a, b, c, SIZE)
   integer :: SIZE
   integer :: a(SIZE), b(SIZE), c(SIZE)
   
   integer :: i
 
!$omp parallel
 
!$omp do nowait
   do i=1,SIZE
      b(i) = a(i)*a(i)
   enddo
 
!$omp do nowait
   do i=1,SIZE
      c(i) = a(i)/2
   enddo
 
!$omp end parallel
   return
end subroutine test
// omp_nowait.cpp
// compile with: /openmp /c
#include 
 
#define SIZE 5
 
void test(int *a, int *b, int *c, int size)
{
    int i;
    #pragma omp parallel
    {
        #pragma omp for nowait
        for (i = 0; i < size; i++)
            b[i] = a[i] * a[i];
 
        #pragma omp for nowait
        for (i = 0; i < size; i++)
            c[i] = a[i]/2;
    }
}
 
int main( )
{
    int a[SIZE], b[SIZE], c[SIZE];
    int i;
 
    for (i=0; i
  

 

Listing 9: nowait Output

           0            0              0
           1            1              0
           2            4              1
           3            9              1
           4           16             2

All commutations in the code use integers. The first column is the number, the second column is the square of the number, and the third column is half of the number. The results are correct, but be very sure to check the code. Because the arrays have no data dependencies, there is no chance of a race condition in the parallel region.

Be absolutely sure nowait does not cause a race condition. Run your code without it and then try running your code with it. If you get different answers, nowait has highlighted a race condition. You should remove the nowait clause and either try to find the race condition or just don’t use it.

reduction

A common operation in high-performance computing is a reduction, which generally is a loop that takes an array and reduces it to a single number. A simple example is searching for the maximum value in an array (i.e., max()). The OpenMP standard includes a reduction() clause for use in do/for loops to parallelize the reduction.

The general form of the clause is:

reduction(operator:list)

The operator can be arithmetic (e.g., +, -, *maxmin) or logical operators in C and Fortran. A simple example (Listing 10) computes the average of an array. In the reduction() clause, the variable avg is automatically declared private and initialized to zero (although in the example, I explicitly initialize it to 0.0). The sum of the array is computed in avg and then divided by the number of elements in the array to compute the average outside the parallel loop.

Listing 10: reduction() Operation

Fortran C
program test6
   use omp_lib
 
   integer :: N = 1000
   real, allocatable :: x(:)
   real :: avg_base, avg
   integer :: i
 
   allocate(x(N))
 
   do i=1,N
      x(i) = real(i)*sin(real(i))
   enddo
 
   avg_base = 0.0
   avg = 0.0
 
   do i =1,N
      avg_base = avg_base + x(i)
   enddo
   write(*,*) 'avg_base = ',(avg_base/N)
 
!$omp parallel private(id)
!$omp do reduction(+:avg)
 
   do i=1,N
      avg = avg + x(i)
   enddo
 
!$omp end do
!$omp end parallel
 
   write(*,*) 'avg = ',(avg/N)
end program test6









#include 
#include 
#include 
#include 
 
#define N 1000
 
 
int main( )
{
   float x[N];
   int i;
   float avg_base, avg;
 
   for (i = 0; i < N; i++)
   {
      x[i] = i * sin(i);
   }
 
   avg_base = 0.0;
   avg = 0.0;
 
   for (i = 0; i < N; i++)
   {
      avg_base = avg_base + x[i];
   }
   avg_base =avg_base / N;
 
   printf("avg_base = %f\n",avg_base);
 
#pragma omp parallel private(i)
   {
#pragma omp for reduction(+:avg)
      for (i = 0; i < N; i++)
      {
         avg = avg + x[i];
      }
   }
   
   avg = avg / N;
   printf("avg is %f\n", avg);
 
}

This simple example illustrates how to define a reduction in a parallel region. Look for loops that produce a single number or just a small number of results. These regions could be targets for the use of a reduction() clause.

Loop Schedule

The compiler will try to divide the work equally among threads for parallel regions defined by omp parallel. You can control how the work is divided with a clause to the directive of the form

!$omp do schedule(kind[, ])

The integer expression chunksize is a positive value, and the values for kind are:

  • static
  • dynamic
  • guided
  • auto
  • runtime

The static schedule breaks up the iteration space into chunk sizes, and the chunks are then assigned cyclically to each thread in order (chunk 0 goes to thread 0, chunk 1 goes to thread 1, etc.). If chunksize is not specified, the iteration space is divided into equal chunks (as much as possible), and each chunk is assigned to a thread in the team.

The dynamic schedule divides the iteration space into chunks of size chunksize and assigns them to threads on a first come, first served basis; that is, when a thread is finished with a chunk, it is assigned the next chunk in the list. When no chunksize is specified, the default is 1. Dynamic scheduling allows you to create more chunks than there are threads and still have them all execute.

The guided schedule is somewhat similar to dynamic, but the chunks start off large and get exponentially smaller (again, with a chunksize of as the default). The specific size of the next chunk is proportional to the number of remaining iterations divided by the number of threads. If you specify chunksize with this schedule, it becomes the minimum size of the chunks.

The auto schedule lets the run time decide the assignment of iterations to threads on its own. For example, if the parallel loop is executed many times, the run time can evolve a schedule with some load balancing characteristics and low overheads.

The fourth schedule, runtime, defers the scheduling decision until run time. It is defined by an environment variable (OMP_SCHEDULE), which allows you to vary the schedule simple by changing OMP_SCHEDULE. You cannot specify a chunksize for this clause.

Fortran and C scheduling clause examples that use the dynamic scheduler with a chunk size of are shown in Listing 11. Which schedule is best really depends on your code, the compiler, the data, and the system. To make things easier when you port to OpenMP, I would recommend leaving the schedule to the default (which is implementation dependent) or changing it to runtime and then specifying it with the OMP_SCHEDULE environment variable. The auto option delivers best performance.

Listing 11: schedule() Clause

Fortran C
!$omp do schedule(dynamic, 4)
#pragma omp for schedule(dynamic, 4)

Summary

In this article I expanded on the previous article about OpenMP directives with some additional directives and some clauses to these directives. Remember, the goal of these articles is to present simple directives that you can use to get started porting your code to OpenMP. In this article, I covered the following topics:

  • Data and control parallelism
  • Teams and loops revisited
  • omp parallel options, including reductions defaultfirstprivatelastprivatenowait, and reduction and loop scheduling

In the next article, I will present some best practices in general use by myself and others. I’ll also take a stab at discussing how to use the OpenMP accelerator directive to run code on GPUs.