OpenMP

Parallel Region

The first step in porting code to OpenMP is to use the fundamental OpenMP parallel construct to define the parallel region (Table 2). This directive creates a team of threads for the parallel region. After the directive, each thread executes the code, including any subroutines or functions. The end directive synchronizes all threads.

Table 2: Defining the Parallel Region

Fortran C
!$omp parallel

...

!$omp end parallel
#pragma omp parallel
{
   ...
}

The parallel region, delimited by omp parallel and end parallel, has a couple of restrictions: It must be defined in the routine of the code, and the code inside the parallel region must be structured. You cannot jump into or out of, a parallel region (i.e., no GOTO commands).

Each thread in the team is assigned a unique ID (thread ID), generally starting with zero. These thread IDs can be used for any purpose the user desires. For example, if a parallel region seems to be running slow, you can use the thread ID to identify whether a thread is running slower than the others.

Some clauses that I will discuss in a later article can be added to the parallel directive to allow more flexibility or to allow easier coding for certain patterns. The focus of this and subsequent articles is to use just a few directives, so you can start porting serial code to OpenMP.

Notice in Table 3 that you can nest parallel regions. The first parallel region will use threads (cores), and the second parallel region (the nested parallel region) will use threads for each thread in the first parallel region. Therefore you have N + N^2 threads. If you don't control the number of threads per parallel region carefully, you can generate more than one thread per core, perhaps inhibiting performance. In some cases, this is a desired behavior, because some processors have cores that can accommodate more than one thread at a time. Just be careful not to overload cores with multiple threads.

Table 3: Nesting Parallel Regions

Fortran C
!$omp parallel
...
   !omp parallel
   ...
   !omp end parallel
---
!$omp end parallel


#pragma omp parallel
{
   ...
   #pragma omp parallel
   {
      ...
   }
   ...
}

Parallelizing Loops with Work-Sharing Constructs

The parallel region spawns a team of threads that each do work, which is how the performance of an application can be improved (spreading the work across multiple cores). The work-sharing constructs must be placed inside the parallel regions. If you don't do this, then only one thread is used. In general, the work-sharing construct is not capable of spawning new threads. Only the omp parallel directive can do that (more on this in a future article).

Each thread in the team has its own data, although cooperating threads can share data (termed “shared” data). Some directives can perform reductions within the team and copy data to the threads.

In general, because OpenMP is based on the SMP model, most variables are shared between the threads by default. For example, in Fortran this could include COMMON blocks or MODULES. In C, file scope variables and static variables are shared. On the other hand, loop index variables are considered private in each thread.

The directive pairs in Table 4 allow you break a loop across threads in the team. OpenMP directives split the loops as evenly as possible across all of the threads in the team without overlap. When the threads are created, it copies the appropriate portions of ab, and to each thread (i.e., the fork portion of the OpenMP model).

Table 4: Breaking a Loop Across Threads

Fortran C
!$omp parallel do
   do i=1,N
      a(i) = b(i) + c(i)
   end do
!$omp end parallel do

#pragma omp parallel for
{
   for (i=0; i < n; i++) {
      a[i] = b[i] + c[i]
   }
}

Once the code inside the region finishes, the data is copied back from the worker threads to the master thread (i.e., the synchronization part of the OpenMP model); then, the threads are destroyed and computing continues with the next instructions.

To gain the most performance from your code, you want to put as much work as possible in the parallel region. If the amount of work is too small, it could take as long as or longer to run than serial code because, creating the threads, copying the data over, and synchronizing the data after the parallel loop can take more time than simply using a serial section of code.

Nested Loops

Assume you have nested loops in your code as shown in Table 5, and try to determine where you would put your parallel region and loop directive for these nested loops. For the most iterations (j*k = 100) by each thread, you would probably put the parallel region around the outside loop (Table 6); otherwise, the work on each thread is less (fewer loops), and you aren't taking advantage of parallelism. To convince yourself, put the parallel do directive around the inside loop, and you’ll see that the parallel region is created and destroyed 100 times, each time only running the innermost loop 10 times.

Table 5: Serial Code with Loops

Fortran C
   do i = 1, 10
      do j = 1, 10
         do k = 1, 10
            A(i,j,k) = i * j * k
         end do
      end do
   enddo
   for (i=0; i < 10; i++) {
      for (j=0; j < 10; j++) {
         for (k=0; k < 10; k++) {
            A[i][j][k] = i * j * k
         }
      }
   }

Table 6: Parallel Code with Loops

Fortran C
!#omp parallel do
   do i = 1, 10
      do j = 1, 10
         do k = 1, 10
            A(i,j,k) = i * j * k
         end do
      end do
   enddo
!$omp end parallel do

#pragma omp parallel for
{
   for (i=0; i < 10; i++) {
      for (j=1; j < 10; j++) {
         for (k=0; k < 10; k++) {
            a[i][j][k] = i * j * k
         }
      }
   }
}