Although Python is a popular language, in the high-performance world, it is not known for being fast. A number of tactics have been employed to make Python faster. We look at three: Numba, Cython, and ctypes.

High-Performance Python – Compiled Code and C Interface

Python is one of the fastest growing languages for computing, the number one language for deep learning, and in the top three for machine learning. Literally thousands of Python add-on modules can be used for everything from plotting data to communicating with embedded hardware.

One of the common complaints about Python is that it is too slow, partly because it is interpreted and partly because of the Global Interpreter Lock (GIL), a mutex that prevents multiple threads from executing Python bytecodes at once. People started coming up with tools to improve the performance of Python. These tools usually take the form of compiling Python or interfacing compiled languages with Python.

In the next set of articles, I cover some tools that you can use to improve the performance of Python. The articles are generally presented in the following manner:

  • Compilation (JIT and static) and interface with C
  • Interfacing Python and Fortran
  • GPUs and Python
  • Dask, Networking, and Python module combinations

Not all of the tools in each category will be covered, but I’ll present some of the major tools.

In this and future articles in the series, I use the Anaconda distribution of Python. It has some of the more current tools available, but it doesn’t have everything, so some tools that aren’t available in Anaconda won’t be presented.

In this article, I investigate compiling Python code with a just-in-time (JIT) compiler, a tool for compiling Python code into compiled C code that can be used as a module within Python, and a tool to compile existing C code into Python modules. The goal of all three of these tools is to make Python code faster.

Compiling Python with Numba

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code at run time; hence, the “JIT” designation. Numba uses the LLVM compiler library for ultimately compiling the code. You can also write CUDA kernels with Numba. Numba has support for automatic parallelization of loops, generation of GPU-accelerated code (both Nvidia and AMD), and the creation of universal functions (ufuncs) and C callbacks. The compiler is under continual development, with the addition of more capability, more performance, and more NumPy functions.

A ufunc operates on an ndarray, element by element. You can think of ufuncs as being a “vectorized” wrapper for a function that takes a fixed number of inputs and produces a fixed number of outputs. Ufuncs are very important to Numba.

Numba takes the Python function, optimizes it, and then converts it into Numba’s intermediate representation. Type inference follows, and the code is converted into LLVM-interpretable code. The resulting code is then fed to LLVM’s JIT compiler to output machine code.

You see the most benefit with Numba on functions that have a great deal of arithmetic intensity (lots of computations). An example would be routines that have loops. Although you can compile Python functions that don’t have loops, they might run slower than the original Python code.

Decorators

Decorators are a really cool part of Python that allow you to call higher order functions. A decorator function takes another function and extends it without explicitly modifying it. In essence, it is a wrapper to existing functions. An in-depth explanation is beyond the scope of this article focused on high-performance Python, but you can read about decorators online.

Numba uses decorators to extend the functions to be compiled by the JIT compiler and has a number of decorators that can be used as part of the JIT compiler, depending on how you import number:

  • @jit (compile Python code with automatic parallelization)
  • @njit (compile Python and ignore the GIL)
  • @generated_jit (flexible specializations)
  • @jitclass (compile Python classes)
  • @cfunc (create C callbacks)
  • @stencil (specify a stencil kernel)
  • @vectorize (allow scalar arguments to be used as NumPy ufuncs)

Even Nvidia GPUs have a decorator,

  • @cuda.jit

as do AMD ROCm GPUs:

  • @roc.jit

Before I show examples, remember that when using a decorator, the code has to be something Numba can compile and have relatively high arithmetic intensity (e.g., loops). Numba can compile lots of Python code but if it runs slower than the native Python code, why use it?

When you first use a decorator such as @jit, the “decorated” code is compiled; therefore, if you time functions, the first pass through the code will include the compilation time. Fortunately, Numba caches the functions as machine code for subsequent usage. If you use the function a second time, it will not include the compilation time (unless you’ve changed the code). This also assumes that you are using the same argument types.

You can pass arguments to the @jit decorator. Numba has two compilation modes: nopython and object. In general, nopython produces much faster code, but it has a limitation that can force Numba to fall back to object (slower) mode. To prevent this from happening and raising an error, you should pass the option nopython=true to the JIT compiler.

Another argument that can be passed is nogil. The GIL prevents threads from colliding within Python. If you are sure that your code is consistent, has no race conditions, and does not require synchronization, you can use nogil=true. Consequently, Numba will release the GIL when entering a compiled function. Note that you can’t use nogil when you are in object mode.

A third argument that can be use with the @jit decorator is parallel. If you pass the argument parallel=true, Numba will do a transformation pass that attempts to parallelize portions of code, as well as other optimizations, automatically. In particular, Numba supports explicit parallel loops. However, it has to be used with nopython=true. Currently, I believe the parallel option only works with CPUs.

A really cool feature of the code transformation pass through the function is that when you use the parallel=true option, you can use Numba’s prange function instead of range, which tells Numba that the loop can be parallelized. Just be sure that the loop does not have cross-iteration dependencies, except for unsupported reductions (those will be run single threaded).

As mentioned previously, when you execute decorated Python code that Numba understands, it is compiled to machine code with LLVM. This compilation happens the first time you execute the code. The example code in Listing 1 is borrowed from a presentation by Matthew Rocklin. It was run in a Jupyter notebook to get the timings.

Listing 1: Python First Run Without Numba

import numpy
 
def sum(x):
    total = 0
    for i in range(x.shape[0]):
        total +=x[i]
    return total
 
x = numpy.arange(10_000_000);
%time sum(x)
 
CPU times: user 1.63 s, sys: 0 ns, total: 1.63 s
Wall time: 1.63 s

Next, add Numba into the code (Listing 2) so the @jit decorator can be used. (Don’t forget to import that Numba module.)

Listing 2: Python First Run With Numba

import numba
import numpy
 
@numba.jit
def sum(x):
    total = 0
    for i in range(x.shape[0]):
        total +=x[i]
    return total
 
x = numpy.arange(10_000_000);
%time sum(x)
 
CPU times: user 145 ms, sys: 4.02 ms, total: 149 ms
Wall time: 149 ms

A speedup is nice to see, but believe it or not, quite a bit of the time is spent compiling the function. Recall that the first pass through the code compiles it. Subsequent passes do not:

CPU times: user 72.3 ms, sys: 8 µs, total: 72.3 ms
Wall time: 72 ms

Notice that the run time the second time is half of the first, so about 70ms were used to compile the code and about 70ms to run the code the first time around. The second time, the code wasn’t compiled, so it only took a little more than 70ms to run.