Python finally has interoperable tools for programming the GPU – with or without CUDA.

High-Performance Python – GPUs

When GPUs became available, C code via CUDA, a parallel computing platform and programming model developed by Nvidia for GPUs, was the logical language of choice. Since then, Python has become the tool of choice for machine learning, deep learning, and, to some degree, scientific code in general.

Not long after the release of CUDA, the Python world quickly created tools for use with GPUs. As with new technologies, a plethora of tools emerged to integrate Python with GPUs. For some time, the tools and libraries were adequate, but soon they started to show their age. The biggest problem was incompatibility.

If you used a tool to write code for the GPU, no other tools could read or use the data on the GPU. After making computations on the GPU with one tool, the data had to be copied back to the CPU. Then a second tool had to copy the data from the CPU to the GPU before commencing its computations. The data movement between the CPU and the GPU really affected overall performance. However, these tools and libraries allowed people to write functions that worked with Python.

In this article, I discuss the Python GPU tools that are being actively developed and, more importantly, likely to interoperate. Some tools don’t need to know CUDA for GPU code, and other tools do need to know CUDA for custom Python kernels.


One of the first tools people think of in regard to Python and GPUs is Numba. It is a just-in-time (JIT) compiler that translates a subset of Python and NumPy code into machine code. The back end of Numba is the LLVM compiler, which has a number of “translators” that convert the code to LLVM’s intermediate code, including for Nvidia and AMD GPUs and CPUs. The really great part is that you don’t have to compile any external code or have a C/C++ compiler installed, because LLVM comes with Numba. You just tell Numba the functions you want compiled, and it compiles them the first time they are used.

Numba uses the Python concept of “decorators” to indicate the functions to be compiled. In my first article on high-performance Python, I defined a decorator as a function that “… takes another function and extends it without explicitly modifying it. In essence, it is a wrapper to existing functions.” Using decorators allows Numba to extend functions to be compiled by the JIT compiler.

If you will recall, Numba has support for automatic parallelization of loops, generation of GPU-accelerated code, and the creation of universal functions (ufuncs) and C callbacks. Numba is very good for code that has high arithmetic intensity (lots of computations). This code typically comprises loops or array functions, but not necessarily. Numba will compile and run almost any Python code, but if you compile Python code that doesn't have a good amount of arthimetic intensity, you run the risk of the compiled code being even slower than the customary code.

Numba has two GPU JIT decorators that apply to this article: cuda.jit (Nvidia) and roc.jit (AMD). However, you can use the vectorize decorator, as well, with a cuda target. They all take Python code and compile for their target GPU. For this article, I use the cuda.jit and the vectorize decorators.

The astute reader will ask, “What’s the difference between cuda.jit and vectorize with target = gpu? The difference is fairly simple, although in the end, they both compile into PTX (parallel thread execution) Assembly. The @vectorize decorator is in the general compiler path, so it can be steered onto a CUDA device. However, @cuda.jit is really the low-level Python CUDA kernel dialect that gives you access to CUDA variables such as threadIdx and helper functions, which are closer to CUDA and give you more flexibility. You don’t have to use these CUDA variables, though.

Numba is focused on numerical data types such as floatintcomplex, and so on; it is not really designed for string data types. It supports just about all math functions and some NumPy functions. Numba is focused, in essence, on data parallelism.

Numba Coding for GPUs

You can program for GPUs two ways with Numba. The first way uses ufuncs or gufuncs (GPU universal functions). In the second way, you define CUDA Python kernels. Don't worry, by sticking to ufuncs, you don’t have to do any CUDA coding if you’d rather not.


Starting with ufuncs/gufuncs, you can create code to run on the GPU. It's very similar to the first article, in which I discussed using it for CPUs. For GPU ufuncs, I use the @vectorize decorator, as in this very simple example of adding two data types together:

from numba import vectorize
@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
    return x + y

The decorator line defines the data types (i.e., int64 here) and the target for the decorator cuda. A simple test for the add_ufunc Numba function is:

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
print('a+b:\n', add_ufunc(a, b))

The answer should be:

 [11 22 33 44]

In the previous example, you had to put everything that was to run on the GPU into a single Numba function; however, you can write GPU functions with Numba that can be called from other Numba functions, but you have to put a decorator on each function because recursive optimization is not possible.

Remember that Numba is focused on improving the performance of arithmetic functions. Typically, that is loops and array operations. As a consequence, Numba does not support all of Python. It really only supports the following Python language constructs:

  • if/elif/else
  • while and for loops
  • basic math operations
  • selected functions from the math and cmath modules
  • tuples

Numba is constantly expanding the functions and constructs it can compile, so be sure to check with new versions of Numba as they come out.

CUDA Python Kernels

The advantage of ufuncs (gufuncs) is that you don’t have to know a thing about CUDA coding (CUDA-less programming!); however, if you want to get into a little bit of CUDA coding, then you can use Numba's second approach to coding for GPUs: CUDA Python kernels. Numba has included Python versions of CUDA functions and variables, such as block dimensions, grid sizes, and the like.

The next example is a CUDA kernel in Python from a Numba notebook for the Nvidia GTC 2017 (Listing 1) that is a version of the addition function shown in the previous section. In this version, you control the looping with CUDA variables provided by Numba when you use the @cuda.jit decorator.

Listing 1: CUDA Kernel

from numba import cuda
def add_kernel(x, y, out):
    tx = cuda.threadIdx.x # this is the unique thread ID within a 1D block
    ty = cuda.blockIdx.x  # Similarly, this is the unique block ID within the 1D grid
    block_size = cuda.blockDim.x  # number of threads per block
    grid_size = cuda.gridDim.x    # number of blocks in the grid
    start = tx + ty * block_size
    stride = block_size * grid_size
    # assuming x and y inputs are same length
    for i in range(start, x.shape[0], stride):
        out[i] = x[i] + y[i]

You can run the following code to use the function in Listing 1:

import numpy as np
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
threads_per_block = 128
blocks_per_grid = 30
add_kernel[blocks_per_grid, threads_per_block](x, y, out)

This simple code allows you to control the number of blocks per grid and the number of threads per block, which are passed to the function and used for the thread computations.

Numba provides other “helper” functions for Python CUDA kernels that allow you to write quite a variety of functions without needing to know a lot about CUDA – unless you want to.


Another excellent tool for writing Python code to run on GPUs is CuPy, an open source library of matrix functions for Nvidia CUDA that uses a number of CUDA libraries (e.g., cuBLAScuDNNcuRANDcuSOLVERcuFFT, and NCCL) to get as much performance as possible. By design, the CuPy interface is as close as possible to NumPy, making code porting much easier.

CuPy supports a large number of data types, including bool_, 8 to 64-bit int and uint, 16 to 64-bit float, and 64 and 128-bit complex types. CuPy has many array creation routines, such as empty and diag, and array manipulation routines reshape and concatenate. Linear algebra functions are also available, including products such as dot and matmul, along with decomposition routines such as cholesky and svd. Many of these routines use cuBLAS as the backing library.

Fundamentally, CuPy uses a class called cupy.ndarray that is very similar to the numpy.ndarray NumPy class. You can instantiate an instance of the class very easily:

>>> import cupy as cp
>>> x_gpu = cp.array([1, 2, 3])

The variable x_gpu is an instantiation of the cupy.ndarray class.

One of the important aspects of computing on GPUs is moving data between the CPU and the GPU. Some special functions in CuPy accomplish this. The first function, cupy.asarray(), can be used to move a numpy.ndarray, a list, or any object that can be passed to a numpy.array that is on the CPU to the current device (the GPU). For example, the cupy.asarray function

>>> x_cpu = np.array([1, 2, 3])
>>> x_gpu = cp.asarray(x_cpu)

can also accept cupy.ndarray objects, allowing the transfer of the array between devices (CPU to GPU).

The GPU array can be transferred from the GPU to the CPU with cupy.asnumpy(),

>>> x_gpu = cp.array([1, 2, 3])
>>> x_cpu = cp.asnumpy(x_gpu)

or you can use cupy.ndarray.get(), if you like.

As mentioned previously, CuPy syntax is very close to NumPy syntax (a design goal), which allows porting; simply change the name of the module to start using the GPU:

>> import numpy as np
>> import cupy as cp
>> x_cpu = np.array([1, 2, 3])
>> l2_cpu = np.linalg.norm(x_cpu)
>> x_gpu = cp.array([1 ,2 ,3])
>> l2_gpu = cp.linalg.norm(x_gpu)

Notice how simple it is to change np to cp to enable GPUs. One recommendation is not to change the import line in the code to something like import cupy as np, which can cause problems if you need to use NumPy code and not CuPy code.

CuPy Example 1 – Matrix Multiplication

The first CuPy example is a simple matrix multiplication in single precision (float32). The arrays are created on the GPU with random values in the range of -1.0 to 1.0:

import math
import cupy as cp
A = cp.random.uniform(low=-1., high=1., size(64,64)).astype(cp.float32)
B = cp.random.uniform(low=-1., high=1., size(64,64)).astype(cp.float32)
C = cp.matmul(A,B)

Notice in this snippet of code that the variable C remains on the GPU. You have to copy the variable back to the CPU explicitly if you need it there, because once the code ends, the data on the GPU is wiped (for all intents and purposes).

CuPy Example 2 – Singular Value Decomposition

The second CuPy example goes a little further than simple matrix multiplication to illustrate how to move data to and from the CPU and GPU. The code snippet simply computes a singular value decomposition (SVD) of a matrix. In this case, the matrix is a single-precision 64x64 matrix of random values. The code is very simple, with the data created on the GPU:

import cupy as cp
A = cp.random.uniform(low=-1., high=1., size=(64, 64)).astype(cp.float32)
u, s, v = cp.linalg.svd(A)

The results us, and v are still on the GPU.

Assume that you want to get us, and v back to the CPU from the GPU. The previous code has been modified by adding a few lines to copy the data back to the CPU and to print some object types (Listing 2), so you can see what happens when the data is moved between device (GPU) and host (CPU).

Listing 2: Copying from GPU to CPU

import cupy as cp
A_cpu = np.random.uniform(low=-1., high=1., size=(64, 64)).astype(np.float32)
A_gpu = cp.asarray(A_cpu)
u_gpu, s_gpu, v_gpu = cp.linalg.svd(A_gpu)
print "type(u_gpu) = ",type(u_gpu)
u_cpu = cp.asnumpy(u_gpu)
print "type(u_cpu) = ",type(u_cpu)
[laytonjb@home4 CUPY]$ python
type(u_gpu) =  
type(u_cpu) =  

Notice that the object type for u_pgu is based on CuPy and the object type for u_cpu isbased on NumPy (it’s on the CPU). I have a tendency to add _gpu to variables that live on the GPU and _cpu for variables that live on the CPU, because if I were to mix them up, I would get an error or spend lots and lots of time debugging.

Some important differences between CuPy and NumPy should be noted:

  • Some casting behaviors from floating point to integer are not defined in the C++ specification. The casting from a negative floating point to an unsigned integer and from infinity to an integer are examples.
  • CuPy random methods support the dtype argument.
  • Out-of-bounds indices and duplicate values in indices are handled differently.
  • Reduction methods return zero-dimension arrays.

Keep these differences in mind as you port your NumPy code to CuPy.

CuPy is undergoing rapid development, and new versions have new features, sometimes fixing the limitations of older versions, so keep watch on the CuPy mailing list and other outlets of Python news.