Parallelizing Code – Loops

OpenACC Programming Approach

In this article I discussed only one directive, loop. You can affect performance a great deal if you look for loop parallelism in your code and start using this directive. There is a recommended approach to using OpenACC directives, including the loopdirective.

Although I do not want to change your coding style, ideally, adding directives to code should be driven by profiling and tracing. Profiling determines the routines where most of the run time is spent, expressed as a simple table: the routine name and how much time was spent in that routine. Then you stack-rank the times, and the top routines are the initial primary focus. Tracing is a timeline examination of what is happening in the application. In the case of accelerators, this includes data movement to and from accelerators.

With the initial list of target routines in hand, you can start adding directives to your code. Generally, adding only one directive at a time is recommended. By incrementally adding directives, you can understand the effect each makes on run time. You add a directive, rebuild the code, run the code, test it to make sure the answers are correct, and then look at the effect on performance.

While adding loops to your code, if run time goes up, don't despair. This can happen because of bad data movement to and from the CPU and the accelerator. (I will cover this topic in an upcoming article.) You need to focus on the accuracy of the output. If the output is not correct, then you might need to change the directive, change the code, or even drop the loopdirective.


OpenACC directives allow you to take serial code and “port” it to multicore CPUs or accelerators such as GPUs. These directives appear to be comments to the compiler, unless the compiler understands the directives, which allows you to use one version of code – reducing the chance of errors and keeping code size down – and build it with your usual compiler. If a compiler understands OpenACC, then simply adding specific flags to your compile line will allow you to build and run with multiple CPU cores or accelerators.

Performance improvements are achieved by locating regions that can be parallelized in your application. A classic approach is to find loops that can be parallelized. This article tackled the parallel and loop OpenACC directives and clauses for both Fortran and C. A best practice is to combine parallel and loop in one directive (i.e., !$acc parallel loop) for a loop nest you want to parallelize and, if the compiler fails to parallelize a loop within that nest, de-nest the troublesome loop and put a !$acc loop directive before it, which allows you to optimize the directives for that specific loop.

When using loop parallelization, it is best practice to focus on the routines that use the most run time. In this fashion, you’ll make the biggest dent in run time with the fewest directives. A simple profile of the application can provide you with the stack rank of the most time consuming routines.

As you add directives to your application, be sure to check the output. It should match the application output when the application isn't built with directives. This is absolutely vital to improving your application. Running fast but not getting the correct output is worthless. Running on processors other than CPUs can produce slightly different output. Therefore you don't want to do a bit-for-bit comparison; rather, you want to compare the output for significant differences. Defining what is “significant” is up to the user and the application, but should not be taken lightly.

OpenACC has a number of directives other than parallel and loop to help you port applications. However, these directives allow you to start attacking “hot spots” in your code immediately to improve performance. Learning just two OpenACC clauses in exchange for improving parallelization through loops isn't a bad trade.

In the next OpenACC article, I discuss data usage, focusing on how you can consider it in combination with parallel loops to get even better performance.