High-Performance Python – Compiled Code and C Interface

Although Python is a popular language, in the high-performance world, it is not known for being fast. A number of tactics have been employed to make Python faster. We look at three: Numba, Cython, and ctypes.

Python is one of the fastest growing languages for computing, the number one language for deep learning, and in the top three for machine learning. Literally thousands of Python add-on modules can be used for everything from plotting data to communicating with embedded hardware.

One of the common complaints about Python is that it is too slow, partly because it is interpreted and partly because of the Global Interpreter Lock (GIL), a mutex that prevents multiple threads from executing Python bytecodes at once. People started coming up with tools to improve the performance of Python. These tools usually take the form of compiling Python or interfacing compiled languages with Python.

In the next set of articles, I cover some tools that you can use to improve the performance of Python. The articles are generally presented in the following manner:

  • Compilation (JIT and static) and interface with C
  • Interfacing Python and Fortran
  • GPUs and Python
  • Dask, Networking, and Python module combinations

Not all of the tools in each category will be covered, but I’ll present some of the major tools.

In this and future articles in the series, I use the Anaconda distribution of Python. It has some of the more current tools available, but it doesn’t have everything, so some tools that aren’t available in Anaconda won’t be presented.

In this article, I investigate compiling Python code with a just-in-time (JIT) compiler, a tool for compiling Python code into compiled C code that can be used as a module within Python, and a tool to compile existing C code into Python modules. The goal of all three of these tools is to make Python code faster.

Compiling Python with Numba

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code at run time; hence, the “JIT” designation. Numba uses the LLVM compiler library for ultimately compiling the code. You can also write CUDA kernels with Numba. Numba has support for automatic parallelization of loops, generation of GPU-accelerated code (both Nvidia and AMD), and the creation of universal functions (ufuncs) and C callbacks. The compiler is under continual development, with the addition of more capability, more performance, and more NumPy functions.

A ufunc operates on an ndarray, element by element. You can think of ufuncs as being a “vectorized” wrapper for a function that takes a fixed number of inputs and produces a fixed number of outputs. Ufuncs are very important to Numba.

Numba takes the Python function, optimizes it, and then converts it into Numba’s intermediate representation. Type inference follows, and the code is converted into LLVM-interpretable code. The resulting code is then fed to LLVM’s JIT compiler to output machine code.

You see the most benefit with Numba on functions that have a great deal of arithmetic intensity (lots of computations). An example would be routines that have loops. Although you can compile Python functions that don’t have loops, they might run slower than the original Python code.

Decorators

Decorators are a really cool part of Python that allow you to call higher order functions. A decorator function takes another function and extends it without explicitly modifying it. In essence, it is a wrapper to existing functions. An in-depth explanation is beyond the scope of this article focused on high-performance Python, but you can read about decorators online.

Numba uses decorators to extend the functions to be compiled by the JIT compiler and has a number of decorators that can be used as part of the JIT compiler, depending on how you import number:

  • @jit (compile Python code with automatic parallelization)
  • @njit (compile Python and ignore the GIL)
  • @generated_jit (flexible specializations)
  • @jitclass (compile Python classes)
  • @cfunc (create C callbacks)
  • @stencil (specify a stencil kernel)
  • @vectorize (allow scalar arguments to be used as NumPy ufuncs)

Even Nvidia GPUs have a decorator,

  • @cuda.jit

as do AMD ROCm GPUs:

  • @roc.jit

Before I show examples, remember that when using a decorator, the code has to be something Numba can compile and have relatively high arithmetic intensity (e.g., loops). Numba can compile lots of Python code but if it runs slower than the native Python code, why use it?

When you first use a decorator such as @jit, the “decorated” code is compiled; therefore, if you time functions, the first pass through the code will include the compilation time. Fortunately, Numba caches the functions as machine code for subsequent usage. If you use the function a second time, it will not include the compilation time (unless you’ve changed the code). This also assumes that you are using the same argument types.

You can pass arguments to the @jit decorator. Numba has two compilation modes: nopython and object. In general, nopython produces much faster code, but it has a limitation that can force Numba to fall back to object (slower) mode. To prevent this from happening and raising an error, you should pass the option nopython=true to the JIT compiler.

Another argument that can be passed is nogil. The GIL prevents threads from colliding within Python. If you are sure that your code is consistent, has no race conditions, and does not require synchronization, you can use nogil=true. Consequently, Numba will release the GIL when entering a compiled function. Note that you can’t use nogil when you are in object mode.

A third argument that can be use with the @jit decorator is parallel. If you pass the argument parallel=true, Numba will do a transformation pass that attempts to parallelize portions of code, as well as other optimizations, automatically. In particular, Numba supports explicit parallel loops. However, it has to be used with nopython=true. Currently, I believe the parallel option only works with CPUs.

A really cool feature of the code transformation pass through the function is that when you use the parallel=true option, you can use Numba’s prange function instead of range, which tells Numba that the loop can be parallelized. Just be sure that the loop does not have cross-iteration dependencies, except for unsupported reductions (those will be run single threaded).

As mentioned previously, when you execute decorated Python code that Numba understands, it is compiled to machine code with LLVM. This compilation happens the first time you execute the code. The example code in Listing 1 is borrowed from a presentation by Matthew Rocklin. It was run in a Jupyter notebook to get the timings.

Listing 1: Python First Run Without Numba

import numpy
 
def sum(x):
    total = 0
    for i in range(x.shape[0]):
        total +=x[i]
    return total
 
x = numpy.arange(10_000_000);
%time sum(x)
 
CPU times: user 1.63 s, sys: 0 ns, total: 1.63 s
Wall time: 1.63 s

Next, add Numba into the code (Listing 2) so the @jit decorator can be used. (Don’t forget to import that Numba module.)

Listing 2: Python First Run With Numba

import numba
import numpy
 
@numba.jit
def sum(x):
    total = 0
    for i in range(x.shape[0]):
        total +=x[i]
    return total
 
x = numpy.arange(10_000_000);
%time sum(x)
 
CPU times: user 145 ms, sys: 4.02 ms, total: 149 ms
Wall time: 149 ms

A speedup is nice to see, but believe it or not, quite a bit of the time is spent compiling the function. Recall that the first pass through the code compiles it. Subsequent passes do not:

CPU times: user 72.3 ms, sys: 8 µs, total: 72.3 ms
Wall time: 72 ms

Notice that the run time the second time is half of the first, so about 70ms were used to compile the code and about 70ms to run the code the first time around. The second time, the code wasn’t compiled, so it only took a little more than 70ms to run.

Cython

Cython is an optimizing static compiler for Python (Python2/Python3) and an extended programming language based on Pyrex. (Note: You had better be moving away from Python 2 and on to Python 3 pretty quickly because, as the Python wiki says, “Python 2.x is legacy, Python 3.x is the present and future of the language.”) The Pyrex language is used to create Python modules. It is really a superset of Python that also supports calling C functions and declaring C types on variables and class attributes. As a result, with some work, very efficient C code can be generated.

Unlike Numba, which is a JIT compiler, Cython translates the Python code to C and compiles it into an appropriate form to be used in Python. In general, the C code compiles with almost any C/C++ compiler, which makes Cython a good tool for compiling Python code that is frequently used but doesn’t change too much.

Cython Examples

Cython can accept almost any valid Python source file to produce C code. Compiling the C code is fairly simple. The first step in using Cython is the easiest: Select the code you want and put it into a separate file. You can have more than one function per file if you like.

The second step is to create the setup.py file, which is like a makefile for Python. It defines what Python file you want to compile into a shareable library and is where you can put options (e.g., compile options) you want to use. After compiling, be sure to test the code.

Here, I use two examples from a Cython tutorial. The first is a simple Hello World example, and the second is a summation example that uses a loop.

Hello World

The Python code to be compiled in the helloworld.pyx file is

print("Hello World")

which is just about the simplest one-line Python script you can have.

As previously mentioned, you need to create a setup.py file that is really a Python makefile:

from distutils.core import setup
from Cython.Build import cythonize
 
setup(
    ext_modules = cythonize("helloworld.pyx")
)

The first two lines are fairly standard for a Python setup.py file. After that, the setup command builds the binary (shared object). In this case, the command is to cythonize the helloworld.pyx file. To make life easier, be sure to put this file in the same directory as the code.

The system I used had Ubuntu 18.04 (with updates) and the Anaconda Python distribution. To build the binary, enter

$ python3 setup.py build_ext --inplace

The output is shown in Listing 3.

Listing 3: Binary Build

$ python3 setup.py build_ext --inplace
Compiling helloworld.py because it changed.
[1/1] Cythonizing helloworld.pyx
/home/laytonjb/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /home/laytonjb/HPC-PYTHON-1/helloworld.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build_ext
building 'helloworld' extension
creating build
creating build/temp.linux-x86_64-3.7
gcc -pthread -B /home/laytonjb/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/laytonjb/anaconda3/include/python3.7m -c helloworld.c -o build/temp.linux-x86_64-3.7/helloworld.o
gcc -pthread -shared -B /home/laytonjb/anaconda3/compiler_compat -L/home/laytonjb/anaconda3/lib -Wl,-rpath=/home/laytonjb/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/helloworld.o -o /home/laytonjb/HPC-PYTHON-1/helloworld.cpython-37m-x86_64-linux-gnu.so

Note that the command line uses setup.py as the “configuration” for building the binary (shared object). In the output, you will see paths that correspond to the system I used. Don’t worry about this because setup.py takes care of the paths.

Now I can test the compiled Cython shared object:

>>> import helloworld
Hello World

It worked! These are the basic steps for creating a compiled binary (shared object) of Python code.

Summing

To begin, I’ll take some simple Python code from the Numba example and compute the sum of a one-dimensional list. Although I’m sure better code is out there for computing a sum, this example will teach you how to do a more mathematical example.

For this example, a simple function in the sum.pyx file computes the sum:

def sum(x):
  total = 0
  for i in range(x.shape[0]):
    total += x[i]
  return total

The code is compiled the same way as the Hello World code, with a change to the Python function in setup.py to cythonize sum.pyx. The code in Listing 4 tests the module in a Jupyter notebook.

Listing 4: Summation Test

import sum
import numpy
x = numpy.arange(10_000_000);
%time sum.sum(x)
 
CPU times: user 1.37 s, sys: 0 ns, total: 1.37 s
Wall time: 1.37 s

Notice that sum is the object and sum.sum is the function within the object, which means you can put more than one function in your Python code. Also notice that the time for running the code is about the same as the pure Python itself. Although you can optimize Cython code by, for example, employing OpenMP, I won’t discuss that here.

Ctypes

Cython takes Python code, converts it to C, and compiles it for you, but what if you have existing C code that you want to use in Python like a Python module? This is where ctypes can help.

The ctypes foreign function library provides C-compatible data types and lets you call functions in dynamic link libraries (DLLs) or shared libraries from within Python. In essence, it “wraps” these libraries so they can be called from Python. You can find ctypes with virtually any Python distribution.

To use ctypes, you typically start with your C/C++ code and build the shareable object as usual. However, be sure to use the position-independent code (PIC) flag and the shared flag (you’ll be building a library). For example, with gcc, you use the -fPIC and -shared options:

$ gcc -fPIC -shared -o libsource.so source.c

It is up to you to compile the C code and create a library using any method you like, as long as you use the -fPIC and -shared options.

Ctypes Example – sum

In the previous example, most of the work is done in the summation, so now I’ll rewrite that routine in C to get better performance. According to an online tutorial, an example in C for computing the sum and building it into a library is shown in Listing 5.

Listing 5: C Summation

int sum_function(int num_numbers, int *numbers) {
    int i;
    int sum = 0;
    for (i = 0; i < num_numbers; i++) {
        sum += numbers[i];
    }
    return sum;
}

The function is named sum_function and the code sum.c. This code can be compiled into a shared object (library) with gcc:

$ gcc -fPIC -shared -o libsum.so sum.c

The compiler creates the shared object libsum.so, a library.

To use the library in Python, a few specific ctypes functions and variables are needed within Python. Because it can make the Python code a bit complex, I write a “wrapper function” for the library in Python (Listing 6).

Listing 6: Wrapper Function

import ctypes
 
_sum = ctypes.CDLL('libsum.so')
_sum.sum_function.argtypes = (ctypes.c_int, ctypes.POINTER(ctypes.c_int))
 
def sum_function(numbers):
    global _sum
    num_numbers = len(numbers)
    array_type = ctypes.c_int * num_numbers
    result = _sum.sum_function(ctypes.c_int(num_numbers), array_type(*numbers))
    return int(result)

Notice that the specific function sum_function is defined. If you have a library with more than one function, you will have to create the interface for each function in this file.

Now to test the C function in Python:

import sum
import numpy
 
x = numpy.arange(10000000)
%time sum.sum_function(x)
 
CPU times: user 2.15 s, sys: 68.4 ms, total: 2.22 s
Wall time: 2.22 s

The eagle has landed! It works! However, you might notice that it is slower than even the pure Python code. Most likely the arithmetic intensity is not great enough to show an improvement. Do not let this deter you. It’s always worth trying ctypes if you want or need more performance.

Summary

Python is amazingly popular right now, with thousands of modules that can be used to extend its capability. However, in the high-performance world, Python is not known for being fast. A number of people have written tools and extensions to make Python faster.

In this article, I presented Numba, which compiles Python code with a just-in-time compiler invoked with a decorator. A simple summation example was found to be much faster than the original Python code. More on Numba will be presented in future articles.

Cython was also discussed briefly. It is a tool for “translating” Python code into C code. A simple Python “makefile” translates and compiles the code for you. It’s actually pretty simple to use and allows you to create a large Python module with lots of functions.

Finally, I presented ctypes, which uses an approach opposite that of Numba and Cython by taking existing C code that is compiled into a library, coupled with some ctypes functions and variables, to create a module that can be used in Python. The ctypes library has a great deal of flexibility, so you can incorporate code written in C into Python.