Pymp – OpenMP-like Python Programming

Applying OpenMP techniques to Python code.

Ever since Python was created, users have been looking for ways to achieve multiprocessing with threads, which the Python global interpreter lock (GIL) prevents. One common approach to getting around the GIL is to run computationally intensive code outside of Python with tools such as Cython and ctypes. You can even use F2PY with compiled C functions.

All of the previously mentioned tools “bypass” Python and rely on a compiled language to provide threaded multiprocessing elements with an interface to Python. What is really needed is either a way to perform threaded processing or a form of multiprocessing in Python itself. A very interesting tool for this purpose is Pymp, a Python-based method of providing OpenMP-like functionality.

OpenMP

OpenMP employs a few principles in its programming model. The first is that everything takes place in threads. The second is the fork-join model, which comprises parallel regions in which one or more threads can be used (Figure 1).

Figure 1: Illustration of the fork-join model for OpenMP.

Only a single thread (the master thread) exists before the parallel region of the OpenMP program. When the parallel region is encountered, the master thread creates a team of parallel threads. The code in this parallel region is executed in parallel among the various team threads.

When the threads complete their code in the parallel region, they synchronize and terminate, with only the master thread remaining. Inside the parallel region threads typically share data, and all of the threads can access this shared data at the same time.

The process of forking threads in a parallel region, joining the data back to the master thread, and terminating the other threads can be done many times in a single program, although you don’t want to do it too often because of the need to create and destroy the threads.

Pymp

Because the goal of Pymp is to bring OpenMP-like functionality to Python, Pymp and Python should naturally share some concepts. A single master thread forks into multiple threads, sharing data and then synchronizing (joining) and destroying the threads.

As with OpenMP applications, when Pymp Python code hits a parallel region, processes – termed child processes – are forked and are in a state that is nearly the same as the “master process.” Note that these are forked processes and not threads, as is typical with OpenMP applications. As for the shared memory, according to the Pymp website, “… the memory is not copied, but referenced. Only when a process writes into a part of the memory [does] it gets its own copy of the corresponding memory region. This keeps the processing overhead low (but of course not as low as for OpenMP threads).”

As the parallel region ends (the “join” phase), all of the child processes exit so that only the master process continues. Any data structures from the child processes are synchronized using either shared memory or a manager process and the pickle protocol via the multiprocessor module. This module has an API similar to the threading module and supports spawning processes.

As with OpenMP, Pymp numbers the child processes with thethread_num variable. The master process has thread_num 0.

With OpenMP, you define a parallel region. In Fortran and C the regions are defined by the directives in Listing 1. Pymp has no way to mark the parallel region. The Pymp website recommends you use a pymp.range or pymp.xrange statement, or even an if-else statement. Doing so achieves the same expected behavior.

Listing 1: Defining a Parallel Region

Fortran C
!$omp parallel
...
!$omp end parallel
 
#pragma omp parallel
{   
...}
#pragma end parallel

From the website, example code might look like:

with pymp.Parallel(4) as p:
  for sec_idx in p.xrange(4):
    if sec_idx == 0:
      p.print('Section 0')
    elif sec_idx == 1:
      p.print('Section 1')
    ...

The first statement in the code outline defines the parallel construct.

As with OpenMP code, you can control various aspects of Pymp code with environment variables (e.g., with the OpenMP variables that begin with OMP), or you can use Pymp versions that begin with PYMP. The mapping is pretty straightforward:

  • PYMP_NESTED/OMP_NESTED
  • PYMP_THREAD_LIMIT/OMP_THREAD_LIMIT
  • PYMP_NUM_THREADS/OMP_NUM_THREADS

The first variable is a binary: TRUE or FALSE. The second variable can be used to set a limit on the number of threads. The third variable is a comma-separated list of the number of threads per nesting level. If only one value is specified, it is used for all levels.

Other aspects to Pymp are explained on the website, including:

  • Schedules
  • Variable scopes
  • Nested Loops
  • Exceptions
  • Conditional parallelism
  • Reductions
  • Iterables

This article is too short to cover these topics, but if you are interested, the GitHub site briefly explains them, and you can create some simple examples for further exploration.

Installing Pymp

Although I use Anaconda, I also use Pip when needed. When I examined both Anaconda and Pip for Pymp, the latest version they both had was 0.01, although the last version released was 0.4.2 on September 7, 2018; therefore, I installed Pymp according to the instructions on the website.

After downloading and exploding the .tar.gz or .zip source file and moving to the directory created, I ran the command,

$ python setup.py develop

which builds Pymp. Fortunately, the Pymp codebase is fairly small, so the build goes very fast.

Note that the website proposes using the latest master branch from the Git repo if you want the absolute latest version of Pymp.

Simple Introductory Code

To illustrate how to code with Pymp, the sample code in Listing 2 from the website begins with basic Python code. To keep things simple, this is a serial code with a single array. Listing 3 is the Pymp version of the same code.

Listing 2: Python Code

from __future__ import print_function
 
ex_array = np.zeros((100,), dtype='uint8')
for index in range(0, 100):
    ex_array[index] = 1
    print('Yay! {} done!'.format(index))

Listing 3: Pymp Code

from __future__ import print_function
 
import pymp
 
ex_array = pymp.shared.array((100,), dtype='uint8')
with pymp.Parallel(4) as p:
    for index in p.range(0, 100):
        ex_array[index] = 1
        # The parallel print function takes care of asynchronous output.
        p.print('Yay! {} done!'.format(index))

The first change to the serial code is creating a shared array with a pymp method. The next change is to add the statement creating the number of processes (with pymp.Parallel(4) as p). Remember that these are forked processes and not threads.

The final action is to change the range function to p.range(0, 100). According to the Pymp website, this is the same as using the static schedule.

The approach illustrated in these code samples bypasses the GIL in favor of the operating system’s fork method. From the GitHub site, “Due to the copy-on-write strategy, this causes only a minimal overhead and results in the expected semantics.” Note that using the system fork operation excludes Windows because it lacks a fork mechanism.

Laplace Solver Example

The next example, the common Laplace solver, is a little more detailed. The code is definitely not the most efficient – it uses loops – but I hope it illustrates how to use Pymp. For the curious, timings are included in the code. Listing 4 is the Python version, and the Pymp version of the code is shown in Listing 5. Changed lines in are marked with arrows (-->**<---).

Listing 4: Python Laplace Solver

import numpy
from time import perf_counter
 
nx = 1201
ny = 1201
 
# Solution and previous solution arrays
sol = numpy.zeros((nx,ny))
soln = sol.copy()
  
for j in range(0,ny-1):
    sol[0,j] = 10.0
    sol[nx-1,j] = 1.0
# end for
  
for i in range(0,nx-1):
    sol[i,0] = 0.0
    sol[i,ny-1] = 0.0
# end for
  
# Iterate
start_time = perf_counter()
for kloop in range(1,100):
    soln = sol.copy()
 
    for i in range(1,nx-1):
        for j in range (1,ny-1):
            sol[i,j] = 0.25 * (soln[i,j-1] + soln[i,j+1] + soln[i-1,j] + soln[i+1,j])
        # end j for loop
    # end i for loop
#end for
end_time = perf_counter()
 
print(' ')
print('Elapsed wall clock time = %g seconds.' % (end_time-start_time) )
print(' ')

Listing 5: Pymp Laplace Solver

-->  import pymp  <--
from time import perf_counter
 
nx = 1201
ny = 1201
 
# Solution and previous solution arrays
-->  sol = pymp.shared.array((nx,ny))  <--
-->  soln = pymp.shared.array((nx,ny))  <--
 
for j in range(0,ny-1):
    sol[0,j] = 10.0
    sol[nx-1,j] = 1.0
# end for
 
for i in range(0,nx-1):
    sol[i,0] = 0.0
    sol[i,ny-1] = 0.0
# end for
 
# Iterate
start_time = perf_counter()
-->  with pymp.Parallel(6) as p:  <--
    for kloop in range(1,100):
        soln = sol.copy()
 
        for i in p.range(1,nx-1):
            for j in p.range (1,ny-1):
                sol[i,j] = 0.25 * (soln[i,j-1] + soln[i,j+1] + soln[i-1,j] + soln[i+1,j])
            # end j for loop
        # end i for loop
    # end kloop for loop
# end with
end_time = perf_counter()
 
print(' ')
print('Elapsed wall clock time = %g seconds.' % (end_time-start_time) )
print(' ')

To show that Pymp is actually doing what it is supposed to do, Table 1 shows the timings for various numbers of cores. Notice that the total time decreases as the number of cores increases, as expected.

Table 1: Pymp Timings

No. of Cores Total Time (sec)
Base (serial) 165.0
1 94.0
2 42.0
4 10.9
6 5.0

Summary

Ever since Python was created, people have been trying to achieve multithreaded computation. Several tools were created to do computations outside of and integrate with Python.

Over time, parallel programming approaches have become standards. OpenMP is one of the original standards and is very popular in C/C++ and Fortran programming, so a large number of developers know and use it in application development. However, OpenMP is used with compiled, not interpreted, languages.

Fortunately, the innovative Python Pymp tool was created. It is an OpenMP-like Python module that uses the fork mechanism of the operating system instead of threads to achieve parallelism. As illustrated in the examples in this article, it’s not too difficult to port some applications to Pymp.

Tags: OpenMP OpenMP , Python Python , threads threads