Lead Image © orson, 123RF.com

Using loop directives to improve performance

Parallelizing Code

Article from ADMIN 49/2019

By Jeff Layton

OpenACC is a great tool for parallelizing applications for a variety of processors. In this article, I look at one of the most powerful directives, parallel loop.

In the last half of 2018, I wrote about critical high-performance computing (HPC) admin tools [1]. Often, HPC admins become programming consultants by helping researchers get started with applications, debug the applications, and improve performance. In addition to administering the system, then, they have to know good programming techniques and what tools to use.

MPI+X

The world is moving toward exascale computing – at least 1018 floating-point operations per second (FLOPS) – at a rapid pace. Even though most systems aren't exascale, quite a few are at least petascale (>1015 FLOPS) and use a large number of nodes. Programming techniques are evolving to accommodate petascale systems while getting ready for exascale. Meanwhile, a key programming technique called MPI+X refers to using the Message Passing Interface (MPI) in an application for data communication between nodes while using something else (the X ) for application coding within the node.

The X can refer to any of several tools or languages, including the use of MPI across all nodes (i.e., MPI+MPI), which has been a prevalent programming technique for quite a while. Classically, each core assigned to an application is assigned an MPI rank and communicates over whatever network exists between the nodes. To adapt to larger and larger systems, data communication has adapted to use multiple levels of communication. MPI ranks within the same node can communicate directly without a network interface card (NIC). Ranks that are not on the same physical node communicate through NIC. Networking techniques can take advantage of specific topologies to reduce latency, improve bandwidth, and improve scalability.