High-Performance Python – GPUs


One of the first tools for running GPU code came from within Python: PyCUDA, which allows you to access the CUDA API from Python. This tool forces you to learn CUDA, in addition to knowing Python. You can see some code samples in the PyCUDA documentation. Note that you can leave the data on the GPU and operate on it with another kernel.

Simple PyCUDA Example

A simple PyCUDA example from a few years ago is shown in Listing 4. This code takes an input array (float32) and doubles the values. A CUDA kernel is defined in the source module; then, the function is compiled with the get_function() method of the mod object. The function is called, and the data is copied back to the host with the memcpy_dtoh function.

Listing 4: PyCUDA Example

import numpy
import pycuda
mod = cuda.SourceModule("""
   __global__ void twice(float *a)
      int idx = threadIdx.x + threadIdx.y*4
      a[idx] *= 2;
func = mod.get_function("twice")
func(a_gpu, block=(4,4,1))
a.doubled = numpy.empty.like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a.doubled
print a


Initially, to code GPUs, you had to learn CUDA. A CUDA C variant had many extra functions, so you could adequately take advantage of GPU performance. At the time, using C as the base language for programming GPUs was very logical, because many people, including scientists, wrote code in C. However, developers didn’t want to learn C and CUDA just to create functions for Python.

Some developers knew C and quickly learned CUDA, so they could write tools and libraries for Python. For several years, it was the wild west in Python GPU tools, each of which stood on its own according to the needs of the developer. None worked with others. Over time, thanks to projects like GOAI and RAPIDS, the tools have slowly come together. Each component stands on its own. Each plays well with others, forming an ecosystem. CuPy can interact and share data with RAPIDS, and Numba can interoperate with both tools.

Very soon, people will say that Python and GPUs go together like peas and carrots. Today, the tools interoperate, allowing code to be easily written in Python or for Python using CUDA. Given the amount of time it has taken to go from zero interoperability to the present state of the community, don't blink.