Using loop directives to improve performance

Parallelizing Code

OpenACC Programming Approach

In this article, I discussed only one directive: parallel loop. You can affect performance a great deal if you look for loop parallelism in your code and start using this directive. There is a recommended approach to using OpenACC directives, including the parallel loop directive.

Although I do not want to change your coding style, ideally, adding directives to code should be driven by profiling and tracing. Profiling determines the routines where most of the run time is spent, expressed as a simple table: the routine name and how much time was spent in that routine. Then you stack-rank the times, and the top routines are the initial primary focus. Tracing is a timeline examination of what is happening in the application. In the case of accelerators, this includes data movement to and from accelerators.

With the initial list of target routines in hand, you can start adding directives to your code. Generally, adding only one directive at a time is recommended. By incrementally adding directives, you can understand the effect each makes on run time. You add a directive, rebuild the code, run the code, test it to make sure the answers are correct, and then look at the effect on performance.

While adding loops to your code, if run time goes up, don't despair. This can happen because of bad data movement to and from the CPU and the accelerator. (I will cover this topic in an upcoming article.) You need to focus on the accuracy of the output. If the output is not correct, then you might need to change the directive, change the code, or even drop the parallel loop directive from that portion of the code.

Summary

OpenACC directives allow you to take serial code and "port" it to multicore CPUs or accelerators such as GPUs. These directives appear to be comments to the compiler, unless the compiler understands the directives, which allows you to use one version of code – reducing the chance of errors and keeping code size down – and build it with your usual compiler. If a compiler understands OpenACC, then simply adding specific flags to your compile line will allow you to build and run with multiple CPU cores or accelerators.

Performance improvements are achieved by locating regions that can be parallelized in your application. A classic approach is to find loops that can be parallelized. This article tackled the parallel loop OpenACC directive for both Fortran and C. A best practice is to combine parallel and loop in one directive (i.e., !$acc parallel loop) for a loop nest you want to parallelize. Some "clauses" can be used to add more specificity to the directive to help the compiler better generate accelerator code. These clauses aren't covered in this article but can be found online.

When using loop parallelization, it is best practice to focus on the routines that use the most run time. In this fashion, you'll make the biggest dent in run time with the fewest directives. A simple profile of the application can provide you with the stack rank of the most time consuming routines.

As you add directives to your application, be sure to check the output. It should match the application output when the application isn't built with directives. This is absolutely vital to improving your application. Running fast but not getting the correct output is worthless. Running on processors other than CPUs can produce slightly different output. Therefore you don't want to do a bit-for-bit comparison; rather, you want to compare the output for significant differences. Defining what is "significant" is up to the user and the application, but should not be taken lightly.

OpenACC has a number of directives other than parallel loop to help you port applications. However, these directives allow you to start attacking "hot spots" in your code immediately to improve performance. Learning just two OpenACC clauses in exchange for improving parallelization through loops isn't a bad trade.

In the next OpenACC article, I discuss data usage, focusing on how you can consider it in combination with parallel loops to get even better performance.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=