Lead Image © Konrad Bak, 123RF.com

Lead Image © Konrad Bak, 123RF.com

High-performance Python – compiled code and C interface

Step Lively

Article from ADMIN 52/2019
Although Python is a popular language, in the high-performance world, it is not known for being fast. A number of tactics have been employed to make Python faster. We look at three: Numba, Cython, and ctypes.

Python is one of the fastest growing languages for computing, the number one language for deep learning, and in the top three for machine learning. Literally thousands of Python add-on modules can be used for everything from plotting data to communicating with embedded hardware.

One of the common complaints about Python is that it is too slow, partly because it is interpreted and partly because of the Global Interpreter Lock (GIL), a mutex that prevents multiple threads from executing Python bytecodes at once. People started coming up with tools to improve the performance of Python, usually taking the form of compiling Python or interfacing compiled languages with Python.

In this article, I investigate compiling Python code with a just-in-time (JIT) compiler, a tool for compiling Python code into compiled C code that can be used as a module within Python, and a tool to compile existing C code into Python modules. The goal of all three tools is to make Python code faster. I use Python's Anaconda distribution, which has some of the more current tools, but it doesn't have everything, so some tools that aren't available in Anaconda won't be presented.

Compiling Python with Numba

Numba [1] is an open source JIT compiler that translates a subset of Python and NumPy [2] code into fast machine code at run time; hence, the "JIT" designation. Numba uses the LLVM [3] compiler library for ultimately compiling the code. You can also write CUDA kernels [4] with Numba. Numba has support for automatic parallelization of loops, generation of GPU-accelerated code (both Nvidia and AMD), and the creation of universal functions (ufuncs) [5] and C callbacks. The compiler is under continual development, with the addition of more capability, more performance, and more NumPy functions.

A ufunc operates on an ndarray [6], element by element. You can think of ufuncs as being a "vectorized" wrapper for a function that takes a fixed number of inputs and produces a fixed number of outputs. Ufuncs are very important to Numba.

Numba takes the Python function, optimizes it, and then converts it into Numba's intermediate representation. Type inference follows, and the code is converted into LLVM-interpretable code. The resulting code is then fed to LLVM's JIT compiler to output machine code.

You see the most benefit with Numba on functions that have a great deal of arithmetic intensity (lots of computations). An example would be routines that have loops. Although you can compile Python functions that don't have loops, they might run slower than the original Python code.


Decorators [7] are a really cool part of Python that allow you to call higher order functions. A decorator function takes another function and extends it without explicitly modifying it. In essence, it is a wrapper to existing functions. An in-depth explanation is beyond the scope of this article focused on high-performance Python, but you can read more about decorators online [8].

Numba uses decorators to extend the functions to be compiled by the JIT compiler and has a number of decorators that can be used as part of the JIT compiler, depending on how you import numba:

  • @jit (compile Python code with automatic parallelization)
  • @njit (compile Python and ignore the GIL)
  • @generated_jit (control flexible specializations)
  • @jitclass (compile Python classes)
  • @cfunc (create C callbacks)
  • @stencil (specify a stencil kernel)
  • @vectorize (allow scalar arguments to be used as NumPy ufuncs)

Even Nvidia GPUs have a decorator,

  • @cuda.jit

as do AMD ROCm GPUs:

  • @roc.jit

Before I show examples, remember that when using a decorator, the code has to be something Numba can compile and have relatively high arithmetic intensity (e.g., loops). Numba can compile lots of Python code but if it runs slower than the native Python code, why use it?

When you first use a decorator such as @jit, the "decorated" code is compiled; therefore, if you time functions, the first pass through the code will include the compilation time. Fortunately, Numba caches the functions as machine code for subsequent usage. If you use the function a second time, it will not include the compilation time (unless you've changed the code). This also assumes that you are using the same argument types.

You can pass arguments to the @jit decorator. Numba has two compilation modes: nopython and object. In general, nopython produces much faster code, but it has a limitation that can force Numba to fall back to object (slower) mode. To prevent this from happening and raising an error, you should pass the option nopython=true to the JIT compiler.

Another argument that can be passed is nogil. The GIL prevents threads from colliding within Python. If you are sure that your code is consistent, has no race conditions, and does not require synchronization, you can use nogil=true. Consequently, Numba will release the GIL when entering a compiled function. Note that you can't use nogil when you are in object mode.

A third argument that can be used with the @jit decorator is parallel. If you pass the argument parallel=true, Numba will do a transformation pass that attempts to parallelize portions of code, as well as other optimizations, automatically. In particular, Numba supports explicit parallel loops. However, it has to be used with nopython=true. Currently, I believe the parallel option only works with CPUs.

A really cool feature of the code transformation passed through the function is that when you use the parallel=true option, you can use Numba's prange function instead of range, which tells Numba that the loop can be parallelized. Just be sure that the loop does not have cross-iteration dependencies, except for unsupported reductions (those will be run single threaded).

As mentioned previously, when you execute decorated Python code that Numba understands, it is compiled to machine code with LLVM. This compilation happens the first time you execute the code. The example code in Listing 1 is borrowed from a presentation by Matthew Rocklin [9]. It was run in a Jupyter notebook to get the timings.

Listing 1

Python First Run Without Numba

import numpy
def sum(x):
   total = 0
   for i in range(x.shape[0]):
      total +=x[i]
   return total
x = numpy.arange(10_000_000);
%time sum(x)
CPU times: user 1.63 s, sys: 0 ns, total: 1.63 s
Wall time: 1.63 s

Next, add Numba into the code (Listing 2) so the @jit decorator can be used. (Don't forget to import that Numba module.)

Listing 2

Python First Run with Numba

import numba
import numpy
def sum(x):
   total = 0
   for i in range(x.shape[0]):
      total +=x[i]
   return total
x = numpy.arange(10_000_000);
%time sum(x)
CPU times: user 145 ms, sys: 4.02 ms, total: 149 ms
Wall time: 149 ms

A speedup is nice to see, but believe it or not, quite a bit of the time is spent compiling the function. Recall that the first pass through the code compiles it. Subsequent passes do not:

CPU times: user 72.3 ms, sys: 8 µs, total: 72.3 ms
Wall time: 72 ms

Notice that the run time the second time is half of the first, so about 70ms were used to compile the code and about 70ms to run the code the first time around. The second time, the code wasn't compiled, so it only took a little more than 70ms to run.


Cython [10] is an optimizing static compiler for Python (Python 2/Python 3) and an extended programming language based on Pyrex [11]. (Note: You had better be moving away from Python 2 and on to Python 3 pretty quickly, because, as the Python wiki [12] says, "Python 2.x is legacy, Python 3.x is the present and future of the language.") The Pyrex language is used to create Python modules. It is really a superset of Python that also supports calling C functions and declaring C types on variables and class attributes. As a result, with some work, very efficient C code can be generated.

Unlike Numba, which is a JIT compiler, Cython translates the Python code to C and compiles it into an appropriate form to be used in Python. In general, the C code compiles with almost any C/C++ compiler, which makes Cython a good tool for compiling Python code that is frequently used but doesn't change too much.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus