Run One Program at any Scale with Legate

Jeff Layton

06/08/2021 07:00 pm

Run Python NumPy code on distributed heterogeneous systems without changing a single line of code.

Python is one of the most used languages today, particularly in machine learning. Python itself doesn’t have a numerically focused data type, so NumPy was created to fill that void, adding support for large, multidimensional arrays and matrices, as well as contributing a good collection of math functions for these arrays and matrices.

NumPy

Most people who code in Python have seen or written NumPy code, but just in case you have not, this quick example creates an “empty” 2D array of size nx by ny:

import numpy as np

nx = 10
ny = 10

a = np.empty((nx,ny))
type(a)

Array a is of data type numpy.ndarray, with which you can do all sorts of things, including adding, subtracting, multiplying, and performing other mathematical operations. Another example is to solve the linear equation, Ax = b:

import numpy as np

nx = 100
ny = 100

a = np.random.rand(nx,ny)
b = np.random.rand(ny)

x = np.linalg.solve(a, b)

Array a and the second part of the tuple, b, are created by a random number generator with random samples from a uniform distribution over [0,1). The equation is then solved by the solve routine.

NumPy on GPUs

NumPy functions are all single threaded unless the underlying NumPy code is multithreaded. GPUs are monsters for matrix computations, but NumPy itself does not run on GPUs. CuPy was created as an almost complete NumPy-compatible library that runs on GPUs. Most routines use a single GPU for running code, although porting functions to use multiple GPUs requires work. CuPy is even adding functions from SciPy to its codebase. As with NumPy, CuPy is open source.

CuPy code looks virtually identical to NumPy code:

import cupy as cp

nx = 10
ny = 10

a = cp.empty(nx,ny)
type(a)

Notice that the CuPy data type is different from NumPy because it is specific to the GPU. However, commands can move data back and forth to the GPU, converting data types for you.

As mentioned previously, you can run almost any NumPy code with CuPy. The second NumPy example is written as CuPy:

import cupy as cp

nx = 100
ny = 100

a = cp.random.rand(nx,ny)
b = cp.random.rand(ny)

x = cp.linalg.solve(a, b)

To run NumPy code on GPUs with CuPy, you need to change your code. Generally, it is not difficult to port most NumPy code to CuPy. Mixing NumPy and CuPy requires extra coding to move data back and forth while paying attention to data types and data location, which increases the amount of code and adds complexity. Moreover, without careful coding, the code would no longer be portable because it needs a CPU and a GPU.

Legate

Porting NumPy code to other architectures, such as multinode (distributed) systems, requires a fair amount of programming knowledge. To get there, you will have to go beyond NumPy. What is really needed is a way to continue to use NumPy and its syntax but have the code run on GPUs and multinode systems. Moreover, because NumPy is an important part of many modules and Python tools, this arrangement would allow them to use GPUs and multinode architectures easily – which is the underlying theme of Legate.

The goal of Legate is to create a backend that transparently allows NumPy code to run on GPUs and multinode systems without code changes. In a way, it is a domain-specific language (DSL) focused on NumPy.

legate.core

Legate is built into several components, beginning with legate.core and is built around a component based on Apache Arrow. Apache Arrow and the legate equivalent “… define a … columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware likes CPUs and GPUs.” It allows various software libraries to share in-memory data buffers without having to copy data around (killing performance).

Remember that Legate is targeting distributed systems of CPUs and GPUs. Consequently, the Legate implementation of Apache Arrow adds synchronization capability and data coherence so that distributed systems can take advantage of the performance that an in-memory data format offers. The GitHub page on legate.core goes into more detail.

The Legate data model is built on top of Legion, a programming model for high-performance programs that runs on distributed heterogeneous systems. Although Legion has a reputation for being difficult to use, code written with the model scale well on CPUs and GPUs. Additionally, Legion code has the reputation of being easy to port to new machines because, according to the legate documentation, it decouples the machine-independent specification of computations from how the application is mapped to the target machine. For more detail on how Legate works with Legion, please refer to the GitHub page.

Because legate.core is the central component of legate.numpy, building legate.core is the first step and isn’t too difficult. The instructions on GitHub are pretty good. I used the Anaconda distribution of Python that has been installed into /usr/local/anaconda3 by root. The startup definitions for Anaconda were in my .bashrc file, so it ran just fine before building legate.core.

The laptop I used has the following properties:

CPU: Intel(R) Core(TM) i5-10300H CPU @2.50GHz
- Processor base frequency: 2.5GHz
- Max turbo frequency: 4.5GHz
- Cache: 8MB
- Four cores (eight with hyperthread)
- 45W thermal design power (TDP)
- 8GB DDR4-2933 memory
- Maximum of two memory channels
- Memory bandwidth: 45.8GBps

Nvidia GeForce 1650 GPU
- Architecture: Turing (TU117)
- Memory: 4GB GDDR5
- Memory speed: 8bps
- Memory bandwidth: 128GBps
- Memory bus: 8-bit
- L2 cache: 1MB
- TDP: 75W
- Base clock: 1,485GHz
- Boost clock: 1,665MHz
- 896 CUDA cores

The laptop runs Ubuntu 20.04 with the 455.45.01 Nvidia driver and CUDA 11.2. All software was installed from the Apt repository for the specific distribution version.

The legate.coreprerequisites are fairly simple:

PyArrow
NumPy
CUDA ≥ 8.0
C++-compatible compiler (g++ in this case)

Because I wanted to build legate.core to run on GPUs, I installed CUDA; then, I built legate.coreas the rootuser with the command:

# ./install.py --cuda --with-cuda /usr/local/cuda-11 --arch volta --install-dir /usr/local/legate

Although I had tried to use the CUDA that was installed as part of Nvidia’s HPC SDK, I could never get the proper paths for CUDA, so I had to remove the paths to the HPC SDK while building Legate. Note that I used --arch volta because the GeForce GTX 1650 in my laptop is a Turing-based architecture, which is newer than Volta, and I installed Legate into /usr/local/.

legate.numpy

The Legate core provides the building blocks for other Legate Python modules. The main one is NumPy on top of Legate, or legate.numpy, which implements some NumPy functions with the goal of someday implementing all of them. At the time of this writing, it supports the following data types: float16, float32, float64, int16, int32, int64, uint16, uint32, uint64, bool, complex64, and complex128. Legate currently only works on up to three-dimensional arrays but in time should support n-dimensional arrays.

Building legate.numpyis not difficult. As the rootuser, use the command,

# python setup.py --with-core /usr/local/legate

which installs legate.numpy where legate.core is located.

A Legate module for Pandas, legate.pandas, is being developed, but I didn’t build or test it for this article, although it should not be difficult to build or use.

Testing legate.numpy

Getting started with Legate is very easy. You do not run your Python code with python3 code.py; rather, you use legate code.py. For your NumPy code to run normally, it needs to important legate.numpyfirst:

import legate.numpy as np

Legate in default mode is in this case just for a single node. As I built it, Legate defaults to four cores (the system has four cores, eight with hyperthreading turned on) with GPU support.

Matrix Multiplication

The test code is a matrix multiplication with the NumPy dot function. The regular NumPy Python code is:

import numpy as np

import time

nx = 5000
ny = 5000

start = time.perf_counter()

a = np.random.rand(nx,ny)
b = np.random.rand(nx,ny)

c = np.dot(a,b)

stop = time.perf_counter()

print("Elapsed time = ",(stop-start)," secs")
print(" ")

The Legate version of the code simply changes the importcommand to:

import legate.numpy as np

This simple change allows the code to take advantage of legate.numpy.

The first test is to run the code with Anaconda Python3:

$ python3 test1.py
Elapsed time =  2.157353051999962  secs

Out of curiosity, I ran the “plain” NumPy code with Legate:

$ legate test1.py
[0 - 7f080ae4e000]    0.000070 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Elapsed time =  2.1458820959996956  secs

The performance is virtually the same as running the code with regular Python and NumPy.

Next, I ran the Legate version of the code:

$ legate test1_legate.py
[0 - 7f488c010000]    0.000075 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Elapsed time =  0.023377846999665053  secs

Note the code is named test1_legate.py to indicate that the correct importcommand was used. The performance with Legate is about 100 times that of regular Python and NumPy.

Finally, I can try running the code with Legate and the GPU:

$ legate --gpus 1 --fbmem 3000 test1_legate.py
[0 - 7fb3326dd000]    0.735097 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Elapsed time =  0.01964395799996055  secs

The first option in the command is to tell Legate to use one GPU in the system (there is only one GPU).

The default GPU framebuffer in Legate is 4GB. I would always get an error when running with the default framebuffer because the GPU in my laptop only has 4GB of memory. The Legate developers suggested I tell Legate to use a framebuffer of 3GB (i.e., the option --fbmem 3000).

Running the code on the GPU is even faster than on the CPU, despite having to move the data back and forth to the GPU.

Summary

Legate allows writing Python code with well-known modules and then running them on distributed heterogeneous systems, which is something of the Holy Grail of coding – writing single-process code and then running it on large-scale systems with GPUs. Even better, Legate is open source.

Legate is used for building equivalent Python modules. The first two are NumPy and Pandas, perhaps two of the most popular Python modules. Legate and legate.numpy are very easy to build by following the GitHub instructions.

I ran a simple matrix multiplication example with NumPy, and I only had to change one line – the module importcommand – to run the code with Legate. The base NumPy code used a single core, but the CPU Legate version used four cores. I also ran the code on a GPU to illustrate how useful Legate can be in porting NumPy code to distributed heterogeneous systems.

Legate is still being developed, so at this time you might not find many NumPy functions fully implemented. Many of the basic mathematical functions have been implemented, including some basic array operations such as matrix multiplication, as shown in the example; however, none of the linear algebra functions in NumPy have been implemented yet. Perhaps contributing code is something you might be willing to do.

Keep an eye on Legate, especially if you use NumPy or Pandas modules: It has the potential to be a game changer.

Tags: CPU , NumPy , Python , python