High-Performance Python – GPUs

Jeff Layton

09/19/2019 07:50 am

Python finally has interoperable tools for programming the GPU – with or without CUDA.

When GPUs became available, C code via CUDA, a parallel computing platform and programming model developed by Nvidia for GPUs, was the logical language of choice. Since then, Python has become the tool of choice for machine learning, deep learning, and, to some degree, scientific code in general.

Not long after the release of CUDA, the Python world quickly created tools for use with GPUs. As with new technologies, a plethora of tools emerged to integrate Python with GPUs. For some time, the tools and libraries were adequate, but soon they started to show their age. The biggest problem was incompatibility.

If you used a tool to write code for the GPU, no other tools could read or use the data on the GPU. After making computations on the GPU with one tool, the data had to be copied back to the CPU. Then a second tool had to copy the data from the CPU to the GPU before commencing its computations. The data movement between the CPU and the GPU really affected overall performance. However, these tools and libraries allowed people to write functions that worked with Python.

In this article, I discuss the Python GPU tools that are being actively developed and, more importantly, likely to interoperate. Some tools don’t need to know CUDA for GPU code, and other tools do need to know CUDA for custom Python kernels.

Numba

One of the first tools people think of in regard to Python and GPUs is Numba. It is a just-in-time (JIT) compiler that translates a subset of Python and NumPy code into machine code. The back end of Numba is the LLVM compiler, which has a number of “translators” that convert the code to LLVM’s intermediate code, including for Nvidia and AMD GPUs and CPUs. The really great part is that you don’t have to compile any external code or have a C/C++ compiler installed, because LLVM comes with Numba. You just tell Numba the functions you want compiled, and it compiles them the first time they are used.

Numba uses the Python concept of “decorators” to indicate the functions to be compiled. In my first article on high-performance Python, I defined a decorator as a function that “… takes another function and extends it without explicitly modifying it. In essence, it is a wrapper to existing functions.” Using decorators allows Numba to extend functions to be compiled by the JIT compiler.

If you will recall, Numba has support for automatic parallelization of loops, generation of GPU-accelerated code, and the creation of universal functions (ufuncs) and C callbacks. Numba is very good for code that has high arithmetic intensity (lots of computations). This code typically comprises loops or array functions, but not necessarily. Numba will compile and run almost any Python code, but if you compile Python code that doesn't have a good amount of arthimetic intensity, you run the risk of the compiled code being even slower than the customary code.

Numba has two GPU JIT decorators that apply to this article: cuda.jit (Nvidia) and roc.jit (AMD). However, you can use the vectorize decorator, as well, with a cuda target. They all take Python code and compile for their target GPU. For this article, I use the cuda.jit and the vectorize decorators.

The astute reader will ask, “What’s the difference between cuda.jit and vectorize with target = gpu? The difference is fairly simple, although in the end, they both compile into PTX (parallel thread execution) Assembly. The @vectorize decorator is in the general compiler path, so it can be steered onto a CUDA device. However, @cuda.jit is really the low-level Python CUDA kernel dialect that gives you access to CUDA variables such as threadIdx and helper functions, which are closer to CUDA and give you more flexibility. You don’t have to use these CUDA variables, though.

Numba is focused on numerical data types such as float, int, complex, and so on; it is not really designed for string data types. It supports just about all math functions and some NumPy functions. Numba is focused, in essence, on data parallelism.

Numba Coding for GPUs

You can program for GPUs two ways with Numba. The first way uses ufuncs or gufuncs (GPU universal functions). In the second way, you define CUDA Python kernels. Don't worry, by sticking to ufuncs, you don’t have to do any CUDA coding if you’d rather not.

Ufuncs

Starting with ufuncs/gufuncs, you can create code to run on the GPU. It's very similar to the first article, in which I discussed using it for CPUs. For GPU ufuncs, I use the @vectorize decorator, as in this very simple example of adding two data types together:

from numba import vectorize
 
@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
    return x + y

The decorator line defines the data types (i.e., int64 here) and the target for the decorator cuda. A simple test for the add_ufunc Numba function is:

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
 
print('a+b:\n', add_ufunc(a, b))

The answer should be:

a+b:
 [11 22 33 44]

In the previous example, you had to put everything that was to run on the GPU into a single Numba function; however, you can write GPU functions with Numba that can be called from other Numba functions, but you have to put a decorator on each function because recursive optimization is not possible.

Remember that Numba is focused on improving the performance of arithmetic functions. Typically, that is loops and array operations. As a consequence, Numba does not support all of Python. It really only supports the following Python language constructs:

if/elif/else
while and for loops
basic math operations
selected functions from the math and cmath modules
tuples

Numba is constantly expanding the functions and constructs it can compile, so be sure to check with new versions of Numba as they come out.

CUDA Python Kernels

The advantage of ufuncs (gufuncs) is that you don’t have to know a thing about CUDA coding (CUDA-less programming!); however, if you want to get into a little bit of CUDA coding, then you can use Numba's second approach to coding for GPUs: CUDA Python kernels. Numba has included Python versions of CUDA functions and variables, such as block dimensions, grid sizes, and the like.

The next example is a CUDA kernel in Python from a Numba notebook for the Nvidia GTC 2017 (Listing 1) that is a version of the addition function shown in the previous section. In this version, you control the looping with CUDA variables provided by Numba when you use the @cuda.jit decorator.

Listing 1: CUDA Kernel

from numba import cuda
 
@cuda.jit
def add_kernel(x, y, out):
    tx = cuda.threadIdx.x # this is the unique thread ID within a 1D block
    ty = cuda.blockIdx.x  # Similarly, this is the unique block ID within the 1D grid
 
    block_size = cuda.blockDim.x  # number of threads per block
    grid_size = cuda.gridDim.x    # number of blocks in the grid
    
    start = tx + ty * block_size
    stride = block_size * grid_size
 
    # assuming x and y inputs are same length
    for i in range(start, x.shape[0], stride):
        out[i] = x[i] + y[i]

You can run the following code to use the function in Listing 1:

import numpy as np
 
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
 
threads_per_block = 128
blocks_per_grid = 30
 
add_kernel[blocks_per_grid, threads_per_block](x, y, out)
print(out[:10])

This simple code allows you to control the number of blocks per grid and the number of threads per block, which are passed to the function and used for the thread computations.

Numba provides other “helper” functions for Python CUDA kernels that allow you to write quite a variety of functions without needing to know a lot about CUDA – unless you want to.

CuPy

Another excellent tool for writing Python code to run on GPUs is CuPy, an open source library of matrix functions for Nvidia CUDA that uses a number of CUDA libraries (e.g., cuBLAS, cuDNN, cuRAND, cuSOLVER, cuFFT, and NCCL) to get as much performance as possible. By design, the CuPy interface is as close as possible to NumPy, making code porting much easier.

CuPy supports a large number of data types, including bool_, 8 to 64-bit int and uint, 16 to 64-bit float, and 64 and 128-bit complex types. CuPy has many array creation routines, such as empty and diag, and array manipulation routines reshape and concatenate. Linear algebra functions are also available, including products such as dot and matmul, along with decomposition routines such as cholesky and svd. Many of these routines use cuBLAS as the backing library.

Fundamentally, CuPy uses a class called cupy.ndarray that is very similar to the numpy.ndarray NumPy class. You can instantiate an instance of the class very easily:

>>> import cupy as cp
>>> x_gpu = cp.array([1, 2, 3])

The variable x_gpu is an instantiation of the cupy.ndarray class.

One of the important aspects of computing on GPUs is moving data between the CPU and the GPU. Some special functions in CuPy accomplish this. The first function, cupy.asarray(), can be used to move a numpy.ndarray, a list, or any object that can be passed to a numpy.array that is on the CPU to the current device (the GPU). For example, the cupy.asarray function

>>> x_cpu = np.array([1, 2, 3])
>>> x_gpu = cp.asarray(x_cpu)

can also accept cupy.ndarray objects, allowing the transfer of the array between devices (CPU to GPU).

The GPU array can be transferred from the GPU to the CPU with cupy.asnumpy(),

>>> x_gpu = cp.array([1, 2, 3])
>>> x_cpu = cp.asnumpy(x_gpu)

or you can use cupy.ndarray.get(), if you like.

As mentioned previously, CuPy syntax is very close to NumPy syntax (a design goal), which allows porting; simply change the name of the module to start using the GPU:

>> import numpy as np
>> import cupy as cp
>> x_cpu = np.array([1, 2, 3])
>> l2_cpu = np.linalg.norm(x_cpu)
>> x_gpu = cp.array([1 ,2 ,3])
>> l2_gpu = cp.linalg.norm(x_gpu)

Notice how simple it is to change np to cp to enable GPUs. One recommendation is not to change the import line in the code to something like import cupy as np, which can cause problems if you need to use NumPy code and not CuPy code.

CuPy Example 1 – Matrix Multiplication

The first CuPy example is a simple matrix multiplication in single precision (float32). The arrays are created on the GPU with random values in the range of -1.0 to 1.0:

import math
import cupy as cp
 
A = cp.random.uniform(low=-1., high=1., size(64,64)).astype(cp.float32)
B = cp.random.uniform(low=-1., high=1., size(64,64)).astype(cp.float32)
C = cp.matmul(A,B)

Notice in this snippet of code that the variable C remains on the GPU. You have to copy the variable back to the CPU explicitly if you need it there, because once the code ends, the data on the GPU is wiped (for all intents and purposes).

CuPy Example 2 – Singular Value Decomposition

The second CuPy example goes a little further than simple matrix multiplication to illustrate how to move data to and from the CPU and GPU. The code snippet simply computes a singular value decomposition (SVD) of a matrix. In this case, the matrix is a single-precision 64x64 matrix of random values. The code is very simple, with the data created on the GPU:

import cupy as cp
 
A = cp.random.uniform(low=-1., high=1., size=(64, 64)).astype(cp.float32)
u, s, v = cp.linalg.svd(A)

The results u, s, and v are still on the GPU.

Assume that you want to get u, s, and v back to the CPU from the GPU. The previous code has been modified by adding a few lines to copy the data back to the CPU and to print some object types (Listing 2), so you can see what happens when the data is moved between device (GPU) and host (CPU).

Listing 2: Copying from GPU to CPU

import cupy as cp
 
A_cpu = np.random.uniform(low=-1., high=1., size=(64, 64)).astype(np.float32)
A_gpu = cp.asarray(A_cpu)
u_gpu, s_gpu, v_gpu = cp.linalg.svd(A_gpu)
print "type(u_gpu) = ",type(u_gpu)
u_cpu = cp.asnumpy(u_gpu)
print "type(u_cpu) = ",type(u_cpu)
 
[laytonjb@home4 CUPY]$ python svd2.py
type(u_gpu) =  
type(u_cpu) =

Notice that the object type for u_pgu is based on CuPy and the object type for u_cpu isbased on NumPy (it’s on the CPU). I have a tendency to add _gpu to variables that live on the GPU and _cpu for variables that live on the CPU, because if I were to mix them up, I would get an error or spend lots and lots of time debugging.

Some important differences between CuPy and NumPy should be noted:

Some casting behaviors from floating point to integer are not defined in the C++ specification. The casting from a negative floating point to an unsigned integer and from infinity to an integer are examples.
CuPy random methods support the dtype argument.
Out-of-bounds indices and duplicate values in indices are handled differently.
Reduction methods return zero-dimension arrays.

Keep these differences in mind as you port your NumPy code to CuPy.

CuPy is undergoing rapid development, and new versions have new features, sometimes fixing the limitations of older versions, so keep watch on the CuPy mailing list and other outlets of Python news.

RAPIDS

Python GPU tools were scattered for many years. Each tool stood on its own, and virtually none were interoperable. As mentioned before, if you used one tool to compute on GPUs, you could not leave the data on the GPU for other tools to access; you had to copy the data back to the CPU, then use the other tool to move the data back to the GPU, operate on it, and then copy it back to the CPU. This clearly was not a good situation.

With the rise of machine learning (ML) and deep learning (DL), it became clear that something had to be done to solve this problem. A standard was needed so that tools could interoperate without having to move the data back to the CPU. In 2017, Nvidia, H2O.ai, Continuum Analytics (now called Anaconda), MapD, BlazingDB, Graphistry, and the Gunrock project from the University of California, Davis, created the GPU Open Analytics Initiative (GOAI). The GOAI mission is to create open framework for developers and scientists to build applications on GPUs. As part of the goals, standard data formats and APIs would be used so that GPU-enabled tools could interoperate as easily as possible.

H2O and Nvidia created the RAPIDS suite of software libraries for GPU acceleration, focusing on data science workflows. RAPIDS is open source, so anyone can contribute to it or modify it to their needs. The overall project was designed for “… end-to-end data science and analytics pipelines entirely on GPUs.” The first release focused on two components: cuDF and cuML.

The core of GOAI's achievements is cuDF, the core component of Python tools and libraries that allows them to interoperate. It is the core DataFrame component of RAPIDS, built on Apache Arrow with a columnar memory format. A key design goal is that it be as close to the Pandas API as possible so that porting Python code to RAPIDS and other tools is very simple. If the GPU library or tool uses cuDF, it can access the data on the GPU from other tools (no moving data back and forth to the CPU). cuDF is the key to allowing libraries to interoperate.

The second piece of RAPIDS, cuML, is a set of libraries that provide GPU versions of machine learning (ML) functions in Scikit-learn. Algorithms have been added to cuML over time, as well as multi-GPU versions and multinode, multi-GPU versions of existing algorithms. For example, you will find many NumPy functions available in RAPIDS.

RAPIDS is the key to running Python ML code and general scientific code on GPUs for three reasons: (1) It is a standard for GPU libraries for sharing data. (2) the ML algorithms are built from functions that are useful for computations other than ML, and (3) it is open source.

Furthermore, RAPIDS is very Pythonic, so if you are used to coding in Python or Pandas and Scikit-learn, you will have no difficulty adapting to RAPIDS. Virtually any data type is accommodated.

RAPIDS is an important part of a base for others tools in and out of RAPIDS. Tools such as cuGraph, a graph analytics library; cuStrings, a string-handling library; and cuSpatial, a CUDA-accelerated GIS and spatiotemporal algorithm library are examples of activity in the RAPIDS arena.

RAPIDS Example

The RAPIDS website has instructions on installing RAPIDS. It can be as simple as

conda install rapids

Note that RAPIDS requires a Linux distro (sorry, no Windows or Mac as of this writing). Nvidia also provides a container with a pre-built RAPIDS free of charge. The RAPIDS website also has pointers to examples, including Jupyter notebooks. As a quick example, take the simple GPU notebook for linear regression. The code is fairly simple because it excludes the CPU and timing code (Listing 3).

Listing 3: Jupyter GPU Linear Regression

import os
 
import cudf as gd
import pandas as pd
 
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
from cuml.linear_model import LinearRegression as cuLR
 
 
n_samples = 2**20
n_features = 399
 
# Generate Data
X,y = make_regression(n_samples=n_samples, 
  n_features=n_features,
  random_state=0)
X = pd.DataFrame(X)
y = pd.Series(y)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, 
  test_size = 0.2, 
  random_state=0)
 
X_cudf = gd.DataFrame.from_pandas(X_train)
X_cudf_test = gd.DataFrame.from_pandas(X_test)
 
y_cudf = gd.Series(y_train.values)
 
# Fit Model
ols_cuml = cuLR(fit_intercept=True,
                normalize=True,
                algorithm='eig')
ols_cuml.fit(X_cudf, y_cudf)
 
# Evaluate the Model
predict_cuml = ols_cuml.predict(X_cudf_test).to_array()
error_cuml = mean_squared_error(y_test, predict_cuml)
 
print("CUML MSE(y): %s" % error_cuml)

The data is generated by the Scikit-learn function make_regression; then, it is split into a training test set used to create the linear regression and a test set used to test the linear regression. Finally, the error between the linear regression and test data is computed.

Python GPU Custom Kernels

Sometimes you need a specific GPU function or routine that is not provided by an existing library or tool. For this case, you need to write a “custom kernel”; that is, a custom GPU kernel (routine). Fortunately, tools exist that allow you to do this. In some cases, it is fairly easy to write a custom kernel that looks a bit like Python. On the other side, you might have to know more about CUDA to write them.

In the next section, I present two tools for writing CUDA Python custom kernels. The first is a CuPy kernel, which you have seen can be used to write code that runs on the GPU with built-in functions. The second is from the PyCUDA tool, which has been around for a fairly long time but requires more CUDA knowledge.

CuPy Custom Kernels

Custom CUDA kernels written with CuPy only require a small snippet of C++, and CuPy automatically wraps and compiles it to make a CUDA binary. Compiled binaries are then cached and reused in subsequent runs. CuPy has four types of custom kernels:

cupy.ElementwiseKernel
cupy.ReductionKernel
cupy.RawKernel
cupy.fusekernel

The first kernel class defines an element-wise kernel with or without broadcasting; the second class defines a reduction kernel, with or without broadcasting; and the third class allows you to write CUDA code for the kernel. The last kernel class lets you "fuse" custom kernels together, which can ease development by splitting the kernels into separate code.

The CuPy kernels come with low-level CUDA support that can substitute for CUDA functions. The general categories of these functions are:

device and memory management
memory hooks
streams and events
profilers

These functions are extremely useful for writing custom kernels. Please read the CuPy Python custom kernel documentation for details on these functions.

cupy.ElementwiseKernel

The element-wise kernel focuses on kernels that operate on an element-wise basis. You don't write the kernel in CUDA, but in Python, which greatly simplifies writing custom kernels. An element-wise kernel comprises four components:

input argument list (comma-separated)
output argument list (comma-separated)
loop body code
kernel name

Each argument comprises a type specifier and an argument name:

>>> kernel = cp.ElementwiseKernel(
...     'float32 x, float32 y', 'float32 z',
...     '''if (x - 2 > y) {
...       z = x * y;
...     } else {
...       z = x + y;
...     }''', 'my_kernel')

In the first line, the object instantiation is named kernel. The next line has the variables to be used as input (x and y) and output (z). These variables can be typed with NumPy data types, as shown. The function code then follows. The last line states the kernel name, which is my_kernel, in this case.

Notice that this code uses an if/else construct, which is allowed in custom kernels. The code itself is Python, which is fairly easy to write; the CuPy manual contains details on how to write the code.

Many times it would be nice to create a generic kernel that can handle multiple data types. CuPy custom kernels allow this with the use of a type placeholder:

kernel = cp.ElementwiseKernel(
  'T x, T y', 'T z',
  '''if (x - 2 > y) {
     z = x * y;
  } else {
     z = x + y;
  }''', 'my_kernel')

If you use type placeholders with the same character in the kernel definition, you can indicate the same type with the actual argument type, which in this case is T. The specific element-wise kernel first checks the output arguments followed by the input arguments to determine the types. If you don’t specify the output argument(s), then the input arguments are used to determine the data type. Additionally, you can use them in the functions themselves.

One of the great things about the ElementwiseKernel class, as well as other custom kernel classes, is that it handles all of the indexing and broadcasting. If you need or want to use manual indexing, you can do that as well with the raw keyword. You can use the special variable i, which indicates the index within the loop, and the method _ind.size(), which indicates the total number of elements to apply to the element-wise operations.

cupy.ReductionKernel

The second type of CuPy custom kernel is the reduction kernel, which is focused on kernels of the Map-Reduce type. The class behind it is the ReductionKernel class; it too, has four parts:

Identity value: Initial value of the reduction.
Mapping expression: Preprocesses each element to be reduced.
Reduction expression: An operator to reduce the multiple mapped values. Two special variables, a and b, are used for this operand.
Post-mapping expression: Transforms the reduced values. The special variable a is used as input. The output should be written to the output variable.

This function uses type placeholders for both the input and output.

cupy.RawKernel

The third CuPy custom kernel type allows you to create a kernel with CUDA and the RawKernel class. As with CUDA, it gives you control of the grid size, block size, shared memory size, and stream by calling the kernel with CUDA's cuLaunchKernel interface. Because the kernel uses CUDA code, you might want to access the kernel attributes by accessing the attributes dictionary (Python data type) or directly through the object's attributes. You can also set certain attributes by directly changing the object's attributes.

cupy.fuse Example

The cupy.fuse decorator can “fuse” custom kernel functions together. In some ways, it can look like a Numba decorator, so it can be easier to use than an ElementwiseKernel or a ReductionKernel. You can the fused function with scalars, NumPy arrays, and, of course, CuPy arrays.

PyCUDA

One of the first tools for running GPU code came from within Python: PyCUDA, which allows you to access the CUDA API from Python. This tool forces you to learn CUDA, in addition to knowing Python. You can see some code samples in the PyCUDA documentation. Note that you can leave the data on the GPU and operate on it with another kernel.

Simple PyCUDA Example

A simple PyCUDA example from a few years ago is shown in Listing 4. This code takes an input array (float32) and doubles the values. A CUDA kernel is defined in the source module; then, the function is compiled with the get_function() method of the mod object. The function is called, and the data is copied back to the host with the memcpy_dtoh function.

Listing 4: PyCUDA Example

import numpy
import pycuda
 
mod = cuda.SourceModule("""
   __global__ void twice(float *a)
   {
      int idx = threadIdx.x + threadIdx.y*4
      a[idx] *= 2;
   }
   """)
 
func = mod.get_function("twice")
func(a_gpu, block=(4,4,1))
 
a.doubled = numpy.empty.like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a.doubled
print a

Summary

Initially, to code GPUs, you had to learn CUDA. A CUDA C variant had many extra functions, so you could adequately take advantage of GPU performance. At the time, using C as the base language for programming GPUs was very logical, because many people, including scientists, wrote code in C. However, developers didn’t want to learn C and CUDA just to create functions for Python.

Some developers knew C and quickly learned CUDA, so they could write tools and libraries for Python. For several years, it was the wild west in Python GPU tools, each of which stood on its own according to the needs of the developer. None worked with others. Over time, thanks to projects like GOAI and RAPIDS, the tools have slowly come together. Each component stands on its own. Each plays well with others, forming an ecosystem. CuPy can interact and share data with RAPIDS, and Numba can interoperate with both tools.

Very soon, people will say that Python and GPUs go together like peas and carrots. Today, the tools interoperate, allowing code to be easily written in Python or for Python using CUDA. Given the amount of time it has taken to go from zero interoperability to the present state of the community, don't blink.

Tags: CUDA , GPU , HPC , Python