OpenACC Directives for Data Movement

Introduction to Data Clauses

OpenACC has a data directive that tells the compiler to create code that performs specific data movements and provides hints about data usage. Table 1 shows the data directive as it is used in C and Fortran.

 

Table 1The data Directive

Fortran C
!$acc data <clause>
#pragma acc data <clause>

The directive is acc data. The five clauses I look at in this article that can combine with the data directive are:

  • copy
  • copyin
  • copyout
  • create
  • present

Their names describe their obvious functions. The specific data clause combined with the acc data directive constitutes the start of a data region (Table 2). In C, the beginning and end of the data region is marked with {curly braces}. In Fortran, the data region begins with the data directive and has another directive to specify the end of the data region. In subsequent sections, I briefly discuss each of the clauses.

Table 2: Data Regions

Fortran C
!$acc data (clause)

..

!$acc end data

#pragma acc data (clause)
{

...

}

copy

The first data clause, copy, copies data to and from the host and accelerator. When entering the data region, the application allocates accelerator memory and then copies data from the host to the GPU. When exiting the data region, the data from the accelerator is copied back to the host. Table 3 shows a simple example of using the copy clause.

Table 3: The copy Clause

Fortran C
!$acc data copy(a)
!$acc parallel loop
  do i=1,n
    a(i) = 0.0
  enddo
!$acc data end
 



#pragma acc data copy(a)
{
  #pragma acc parallel loop
  {
    for (int i=0; i < n; i++)
    {
      a[i] = 0.0
    }
  }
}

The entire array is copied from the host to the accelerator with the acc data copy directive. The loop then is run on the accelerator by the acc parallel loop directive. After the loop is finished, the array is copied from the accelerator back to the host courtesy of the acc end data directive for Fortran or the closing curly brace for C. OpenACC allows you to combine directives into a single line, so you could write the previous code as shown in Table 4.

Table 4: Combining Directives

Fortran C
!$acc parallel loop copy(a)
  do i=1,n
    a(i) = 0.0
  enddo
 
 
 
#pragma acc parallel loop copy(a)
{
  for (int i=0; i < n; i++)
  {
    a[i] = 0.0
  }
}

In Fortran, you no longer have to use an end data directive. The copy data clause ends where the parallel loop directive ends, which is implicit after the loop. In the case of C, combining directives on one line means you don’t need to use additional curly braces to define the data region, saving some typing and coding.

copyin

The next data clause, copyin, allocates memory on the accelerator and copies data from the host to the accelerator when entering the data region indicated by the directive (Table 5); however, it does not copy the data back to the host on exiting the data region. This directive is typically used when you want to copy data to the device, but you don't need the data to be copied back out because it hasn’t changed. Classically, this is used for “input” data to the OpenACC region.

It is important to note that the data moved by the copyin clause is left in the GPU memory. No definition in the OpenACC specification says what happens to that data when the data region is exited. You could explicitly delete this memory in a data directive with the delete clause (not covered in this article), or you could just leave it.

Table 5: The copyin Clause

Fortran C
!$acc parallel loop copy(a) copyin(b)
  do i=1,n
    a(i) = b(i)
  enddo
 
 
 
#pragma acc parallel loop copy(a) copyin(b)
{
  for (int i=0; i < n; i++)
  {
    a[i] = b[i]
  }
}

Because array is not needed after it is moved into the accelerated region, the parallel loop copyin directive just copies into the parallel region, but not out. For this example, I use the copyin directive for array and the copy directive for array a.

Although I could have used the copy clause for b, because I don't need it afterward, I can save some data movement and make the overall code a little faster and a little more scalable. (Lots of small performance gains equal big performance gains.)

copyout

The third data clause, copyout, allocates memory on the accelerator when entering the accelerated region but only copies data from the accelerator to the host when exiting the OpenACC data region (Table 6). No data is copied from the host to the accelerator. This directive is classically used only to return data from a directive region to the host; that is, it is just “output” from an accelerated region.

Table 6: The copyout Clause

Fortran C
!$acc parallel loop copyin(a, b) 
    copyout(c)
   do i=1,n
      c(i) = a(i) * b(i)
   enddo
 
 
 
#pragma acc parallel loop copyin(a, b) 
    copyout(c)
{
   for (int i=0; i < n; i++)
   {
      c[i] = a[i] * b[i]
   }
}

A subtlety to note for this example is that you can put multiple variables in a single data movement clause (the five discussed in this article). This example uses copyin(a, b) to copy both arrays into the accelerated region. You don't have to put all variables into a single directive, but it sometimes helps to save space and typing, thus reducing errors. Just be careful that you either end each data region with the acc end data directive in Fortran or a closed curly brace in C.

create

The fourth data clause, create, allocates memory on the accelerator when the accelerated region is entered and deallocates the memory when the accelerated region is exited (Table 7). No data is copied to or from the host and the accelerator. Because the data is local to the accelerator, you can think of it as temporary.

Table 7: The create Clause

Fortran C
!$acc parallel loop copyin(a, b, e) 
    create(c) copyout(d)
  do i=1,n
    c(i) = a(i) * b(i)
  enddo
  do j=(n/2),n
    d(j) = c(j) - e(j)
  end do
 



#pragma acc parallel loop copyin(a, b, e) 
    create(c) copyout(d)
{
  for (int i=0; i < n; i++)
  {
    c[i] = a[i] * b[i]
  }
  for (int j=(n/2)+1; j < n; j++)
  {
    d[j] = c[j] - e[j]
  }
}

For this example, the arrays ab, and are copied into the accelerated region but are not used after that, so the copyin clause is used. Array is copied from the accelerated region to the host after computations are finished, but it is not copied into the accelerated region. Therefore a copyout clause is used.

Array is only used in the accelerated region. It is allocated when entering the accelerated region and deallocated when leaving the accelerator region. It is not copied to or from either the host or the accelerator; therefore, the create clause is used.

This example is very, very simple. Although you could write it with one loop so you wouldn’t have to create a local array on the accelerator, this example just illustrates how you could do so.

present

To help the compiler produce better code, the present clause in a data directive makes the compiler check whether the data is on device (Table 8). If it isn't, the execution will abort.

Table 8: The present Clause

Fortran C
!$acc data copyin(a, b) copy(c)
!$acc parallel loop
  do i=1,n
    c(i) = a(i) * b(i)
  enddo
 


...
 
!$acc parallel loop copyin(e) 
    present(a) copyout(f)
  do j=1,n
    f(i) = 2.0*e(j) + 
           (1.0/4.0)*(a(j)*4.14)
  end
 
!$acc end data
#pragma acc data copyin(a, b) copy(c) {
  #pragma acc parallel loop {
    for (int i=0; i < n; i++)
    {
      c[i] = a[i] * b[i]
    }
  }
 
...
 
#pragma acc parallel loop copyin(e) 
    present(a) copyout(f) {
    for (int j=0; j < n; j++)
    {
      f[i] = 2.0*e[j] + (1.0/4.00*(a[j]*4.14)
    }
  }
}

In this simple code, the data region is defined around the two parallel loop directives. The data directive copies the and data into the accelerator memory but doesn't expect it to be copied back to the host when exiting the data region. It also copies into the accelerator memory, where it is probably operated on in some way; then, the data is copied back to the host when exiting the data region.

After the parallel region, perhaps some code is executed on the host (e.g., some I/O), during which it is assumed that arrays and are not changed. After the sequential region, a second parallel region is encountered, in which array is copied into the device memory (the copyin clause), and is created on the device and copied out when exiting the region (the create clause). However, you don’t need to copy into the device memory because it’s already there, courtesy of the present clause: The present(a) clause tells the compiler to look for in the device memory.