High-Performance Python – GPUs


Python GPU tools were scattered for many years. Each tool stood on its own, and virtually none were interoperable. As mentioned before, if you used one tool to compute on GPUs, you could not leave the data on the GPU for other tools to access; you had to copy the data back to the CPU, then use the other tool to move the data back to the GPU, operate on it, and then copy it back to the CPU. This clearly was not a good situation.

With the rise of machine learning (ML) and deep learning (DL), it became clear that something had to be done to solve this problem. A standard was needed so that tools could interoperate without having to move the data back to the CPU. In 2017, Nvidia, H2O.ai, Continuum Analytics (now called Anaconda), MapD, BlazingDB, Graphistry, and the Gunrock project from the University of California, Davis, created the GPU Open Analytics Initiative (GOAI). The GOAI mission is to create open framework for developers and scientists to build applications on GPUs. As part of the goals, standard data formats and APIs would be used so that GPU-enabled tools could interoperate as easily as possible.

H2O and Nvidia created the RAPIDS suite of software libraries for GPU acceleration, focusing on data science workflows. RAPIDS is open source, so anyone can contribute to it or modify it to their needs. The overall project was designed for “… end-to-end data science and analytics pipelines entirely on GPUs.” The first release focused on two components: cuDF and cuML.

The core of GOAI's achievements is cuDF, the core component of Python tools and libraries that allows them to interoperate. It is the core DataFrame component of RAPIDS, built on Apache Arrow with a columnar memory format. A key design goal is that it be as close to the Pandas API as possible so that porting Python code to RAPIDS and other tools is very simple. If the GPU library or tool uses cuDF, it can access the data on the GPU from other tools (no moving data back and forth to the CPU). cuDF is the key to allowing libraries to interoperate.

The second piece of RAPIDS, cuML, is a set of libraries that provide GPU versions of machine learning (ML) functions in Scikit-learn. Algorithms have been added to cuML over time, as well as multi-GPU versions and multinode, multi-GPU versions of existing algorithms. For example, you will find many NumPy functions available in RAPIDS.

RAPIDS is the key to running Python ML code and general scientific code on GPUs for three reasons: (1) It is a standard for GPU libraries for sharing data. (2) the ML algorithms are built from functions that are useful for computations other than ML, and (3) it is open source.

Furthermore, RAPIDS is very Pythonic, so if you are used to coding in Python or Pandas and Scikit-learn, you will have no difficulty adapting to RAPIDS. Virtually any data type is accommodated.

RAPIDS is an important part of a base for others tools in and out of RAPIDS. Tools such as cuGraph, a graph analytics library; cuStrings, a string-handling library; and cuSpatial, a CUDA-accelerated GIS and spatiotemporal algorithm library are examples of activity in the RAPIDS arena.

RAPIDS Example

The RAPIDS website has instructions on installing RAPIDS. It can be as simple as

conda install rapids

Note that RAPIDS requires a Linux distro (sorry, no Windows or Mac as of this writing). Nvidia also provides a container with a pre-built RAPIDS free of charge. The RAPIDS website also has pointers to examples, including Jupyter notebooks. As a quick example, take the simple GPU notebook for linear regression. The code is fairly simple because it excludes the CPU and timing code (Listing 3).

Listing 3: Jupyter GPU Linear Regression

import os
import cudf as gd
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from cuml.linear_model import LinearRegression as cuLR
n_samples = 2**20
n_features = 399
# Generate Data
X,y = make_regression(n_samples=n_samples, 
X = pd.DataFrame(X)
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
  test_size = 0.2, 
X_cudf = gd.DataFrame.from_pandas(X_train)
X_cudf_test = gd.DataFrame.from_pandas(X_test)
y_cudf = gd.Series(y_train.values)
# Fit Model
ols_cuml = cuLR(fit_intercept=True,
ols_cuml.fit(X_cudf, y_cudf)
# Evaluate the Model
predict_cuml = ols_cuml.predict(X_cudf_test).to_array()
error_cuml = mean_squared_error(y_test, predict_cuml)
print("CUML MSE(y): %s" % error_cuml)

The data is generated by the Scikit-learn function make_regression; then, it is split into a training test set used to create the linear regression and a test set used to test the linear regression. Finally, the error between the linear regression and test data is computed.

Python GPU Custom Kernels

Sometimes you need a specific GPU function or routine that is not provided by an existing library or tool. For this case, you need to write a “custom kernel”; that is, a custom GPU kernel (routine). Fortunately, tools exist that allow you to do this. In some cases, it is fairly easy to write a custom kernel that looks a bit like Python. On the other side, you might have to know more about CUDA to write them.

In the next section, I present two tools for writing CUDA Python custom kernels. The first is a CuPy kernel, which you have seen can be used to write code that runs on the GPU with built-in functions. The second is from the PyCUDA tool, which has been around for a fairly long time but requires more CUDA knowledge.

CuPy Custom Kernels

Custom CUDA kernels written with CuPy only require a small snippet of C++, and CuPy automatically wraps and compiles it to make a CUDA binary. Compiled binaries are then cached and reused in subsequent runs. CuPy has four types of custom kernels:

  1. cupy.ElementwiseKernel
  2. cupy.ReductionKernel
  3. cupy.RawKernel
  4. cupy.fusekernel

The first kernel class defines an element-wise kernel with or without broadcasting; the second class defines a reduction kernel, with or without broadcasting; and the third class allows you to write CUDA code for the kernel. The last kernel class lets you "fuse" custom kernels together, which can ease development by splitting the kernels into separate code.

The CuPy kernels come with low-level CUDA support that can substitute for CUDA functions. The general categories of these functions are:

  • device and memory management
  • memory hooks
  • streams and events
  • profilers

These functions are extremely useful for writing custom kernels. Please read the CuPy Python custom kernel documentation for details on these functions.


The element-wise kernel focuses on kernels that operate on an element-wise basis. You don't write the kernel in CUDA, but in Python, which greatly simplifies writing custom kernels. An element-wise kernel comprises four components:

  • input argument list (comma-separated)
  • output argument list (comma-separated)
  • loop body code
  • kernel name

Each argument comprises a type specifier and an argument name:

>>> kernel = cp.ElementwiseKernel(
...     'float32 x, float32 y', 'float32 z',
...     '''if (x - 2 > y) {
...       z = x * y;
...     } else {
...       z = x + y;
...     }''', 'my_kernel')

In the first line, the object instantiation is named kernel. The next line has the variables to be used as input (x and y) and output (z). These variables can be typed with NumPy data types, as shown. The function code then follows. The last line states the kernel name, which is my_kernel, in this case.

Notice that this code uses an if/else construct, which is allowed in custom kernels. The code itself is Python, which is fairly easy to write; the CuPy manual contains details on how to write the code.

Many times it would be nice to create a generic kernel that can handle multiple data types. CuPy custom kernels allow this with the use of a type placeholder:

kernel = cp.ElementwiseKernel(
  'T x, T y', 'T z',
  '''if (x - 2 > y) {
     z = x * y;
  } else {
     z = x + y;
  }''', 'my_kernel')

If you use type placeholders with the same character in the kernel definition, you can indicate the same type with the actual argument type, which in this case is T. The specific element-wise kernel first checks the output arguments followed by the input arguments to determine the types. If you don’t specify the output argument(s), then the input arguments are used to determine the data type. Additionally, you can use them in the functions themselves.

One of the great things about the ElementwiseKernel class, as well as other custom kernel classes, is that it handles all of the indexing and broadcasting. If you need or want to use manual indexing, you can do that as well with the raw keyword. You can use the special variable i, which indicates the index within the loop, and the method _ind.size(), which indicates the total number of elements to apply to the element-wise operations.


The second type of CuPy custom kernel is the reduction kernel, which is focused on kernels of the Map-Reduce type. The class behind it is the ReductionKernel class; it too, has four parts:

  • Identity value: Initial value of the reduction.
  • Mapping expression: Preprocesses each element to be reduced.
  • Reduction expression: An operator to reduce the multiple mapped values. Two special variables, a and b, are used for this operand.
  • Post-mapping expression: Transforms the reduced values. The special variable a is used as input. The output should be written to the output variable.

This function uses type placeholders for both the input and output.


The third CuPy custom kernel type allows you to create a kernel with CUDA and the RawKernel class. As with CUDA, it gives you control of the grid size, block size, shared memory size, and stream by calling the kernel with CUDA's cuLaunchKernel interface. Because the kernel uses CUDA code, you might want to access the kernel attributes by accessing the attributes dictionary (Python data type) or directly through the object's attributes. You can also set certain attributes by directly changing the object's attributes.

cupy.fuse Example

The cupy.fuse decorator can “fuse” custom kernel functions together. In some ways, it can look like a Numba decorator, so it can be easier to use than an ElementwiseKernel or a ReductionKernel. You can the fused function with scalars, NumPy arrays, and, of course, CuPy arrays.