Run One Program at any Scale with Legate

Matrix Multiplication

The test code is a matrix multiplication with the NumPy dot function. The regular NumPy Python code is:

import numpy as np

import time

nx = 5000
ny = 5000

start = time.perf_counter()

a = np.random.rand(nx,ny)
b = np.random.rand(nx,ny)

c = np.dot(a,b)

stop = time.perf_counter()

print("Elapsed time = ",(stop-start)," secs")
print(" ")

The Legate version of the code simply changes the importcommand to:

import legate.numpy as np

This simple change allows the code to take advantage of legate.numpy.

The first test is to run the code with Anaconda Python3:

$ python3 test1.py
Elapsed time =  2.157353051999962  secs

Out of curiosity, I ran the “plain” NumPy code with Legate:

$ legate test1.py
[0 - 7f080ae4e000]    0.000070 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Elapsed time =  2.1458820959996956  secs

The performance is virtually the same as running the code with regular Python and NumPy.

Next, I ran the Legate version of the code:

$ legate test1_legate.py
[0 - 7f488c010000]    0.000075 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Elapsed time =  0.023377846999665053  secs

Note the code is named test1_legate.py to indicate that the correct importcommand was used. The performance with Legate is about 100 times that of regular Python and NumPy.

Finally, I can try running the code with Legate and the GPU:

$ legate --gpus 1 --fbmem 3000 test1_legate.py
[0 - 7fb3326dd000]    0.735097 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Elapsed time =  0.01964395799996055  secs

The first option in the command is to tell Legate to use one GPU in the system (there is only one GPU).

The default GPU framebuffer in Legate is 4GB. I would always get an error when running with the default framebuffer because the GPU in my laptop only has 4GB of memory. The Legate developers suggested I tell Legate to use a framebuffer of 3GB (i.e., the option --fbmem 3000).

Running the code on the GPU is even faster than on the CPU, despite having to move the data back and forth to the GPU.

Summary

Legate allows writing Python code with well-known modules and then running them on distributed heterogeneous systems, which is something of the Holy Grail of coding – writing single-process code and then running it on large-scale systems with GPUs. Even better, Legate is open source.

Legate is used for building equivalent Python modules. The first two are NumPy and Pandas, perhaps two of the most popular Python modules. Legate and legate.numpy are very easy to build by following the GitHub instructions.

I ran a simple matrix multiplication example with NumPy, and I only had to change one line – the module importcommand – to run the code with Legate. The base NumPy code used a single core, but the CPU Legate version used four cores. I also ran the code on a GPU to illustrate how useful Legate can be in porting NumPy code to distributed heterogeneous systems.

Legate is still being developed, so at this time you might not find many NumPy functions fully implemented. Many of the basic mathematical functions have been implemented, including some basic array operations such as matrix multiplication, as shown in the example; however, none of the linear algebra functions in NumPy have been implemented yet. Perhaps contributing code is something you might be willing to do.

Keep an eye on Legate, especially if you use NumPy or Pandas modules: It has the potential to be a game changer.