OpenACC directives can improve performance if you know how to find where parallel code will make the greatest difference.

Porting Code to OpenACC

In previous articles, I talked about how OpenACC can help you parallelize your code and gave a few simple examples of how to use OpenACC directives, but I didn’t discuss how to go about porting your code. In this article, I present one approach to porting code.

Checker

When porting code to OpenACC, or really any new language, you need to be able to check that the output is correct. One way is to have the code itself check the output, which can involve comparing the results against a known good solution stored within the application or that can easily be computed by the application. Another way is to use an external output checker that takes the output from the application and checks it against known, good output. This can be as simple as reading two data files and comparing the results.

Ouput from a CPU compared with a CPU+GPU can differ by a small amount through reduction. For example, summing floating point numbers in different orders through parallelism can yield slightly different results. FMA (fuse-multiply-add) is another source of difference that occurs when you multiply then add, rather than doing it in one step.

As a result, you should not perform a bit-for-bit comparison of the “correct” answer from the ported code (i.e., don't use something like md5sum), because you will find differences that are not necessarily indicative of incorrect code.

Instead, you should look at differences relative to the data values. For example, if the difference between two numbers is 100.0, but you are working with values of 10^8, then the difference (0.001%) might not be important. It’s really up to the user to decide whether the comparison is significant or not, but it's likely that a difference of this kind is not important, and the answer can be considered “correct.”

Comparing matrices is another challenge. A simple article search online illustrates different ways to compare two matrices. As a brute force method, you could subtract the (i,j) entries from the “new” output relative to the known “good” output to get the largest difference in the matrices. (It is a scalar value.) Next, search through the known good matrix for the largest absolute value and compute the largest percent difference: (largest difference)/(largest value). Although this approache has no mathematical underpinnings, it allows you to look for the largest possible difference. How you compare two matrices is up to you, but be sure that you understand what you are computing and what the output means.

Profiling

As you port your application to use OpenACC, understanding the effect of directives on performance is key. The first and arguably the easiest measure of performance is wall clock time for the entire application (start to finish). The second is the profile of the application, or the amount of time spent in each routine in the code.

Note that “profile” here does not mean a time history but is simply a list of the sum total of time spent in each routine of the application. If you order these times from the routine with the most time to the routine with the least time, you quickly see which routines take the most time, which suggests an “attack” plan for porting. By focusing on the routines that take the most time, you can use OpenACC directives to reduce the run time of that routine and thus reduce the overall application run time.

One way to profile your application is to instrument it to measure the wall clock time of each routine. If routines are called repeatedly, it will need to sum these times. A second way is to use the profiling tools in the compiler. The exact details depend on the specific compiler you are using. If you are using GCC, you can find some resources online. If you are using the PGI compiler, you can use pgprof at the command line. An online tutorial explains how to use pgprof with OpenACC applications. For this article, I use the Community Edition of the PGI compilers.

Assume you have an application you want to port to OpenACC. The first thing to do is profile it by creating a stack rank of the routines with the most run time. Compile your code normally with pgprof (no extra switches), then run the code with the command:

$ pgprof [options] ./exe [application arguments]

In this case, the executable is named exe for simplicity. For a first profile, I use the options:

$ pgprof --cpu-profiling-thread-mode separated --cpu-profiling-mode top-down -o my.prof

The first option separates the output by thread, which is really useful if the code uses MPI, which often uses extra threads. The second option lists the routines that use the most time first and the routines that use the least amount of time last, which is referred to as a “stack rank” (first to the last). The third option, -o my.profallows you to redirect the profile data to an output file; then, you can rerun the analysis anytime you want with the simple command:

$ pgprof -i my.prof

This simple application of pgprof can get you started profiling your application. You should profile your application every so often to understand how the routines stack rank relative to one another. Although it is likely to change, it can tell you what routines to target for OpenACC.