Run One Program at any Scale with Legate


Porting NumPy code to other architectures, such as multinode (distributed) systems, requires a fair amount of programming knowledge. To get there, you will have to go beyond NumPy. What is really needed is a way to continue to use NumPy and its syntax but have the code run on GPUs and multinode systems. Moreover, because NumPy is an important part of many modules and Python tools, this arrangement would allow them to use GPUs and multinode architectures easily – which is the underlying theme of Legate.

The goal of Legate is to create a backend that transparently allows NumPy code to run on GPUs and multinode systems without code changes. In a way, it is a domain-specific language (DSL) focused on NumPy.


Legate is built into several components, beginning with legate.core and is built around a component based on Apache Arrow. Apache Arrow and the legate equivalent “… define a … columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware likes CPUs and GPUs.” It allows various software libraries to share in-memory data buffers without having to copy data around (killing performance).

Remember that Legate is targeting distributed systems of CPUs and GPUs. Consequently, the Legate implementation of Apache Arrow adds synchronization capability and data coherence so that distributed systems can take advantage of the performance that an in-memory data format offers. The GitHub page on legate.core goes into more detail.

The Legate data model is built on top of Legion, a programming model for high-performance programs that runs on distributed heterogeneous systems. Although Legion has a reputation for being difficult to use, code written with the model scale well on CPUs and GPUs. Additionally, Legion code has the reputation of being easy to port to new machines because, according to the legate documentation, it decouples the machine-independent specification of computations from how the application is mapped to the target machine. For more detail on how Legate works with Legion, please refer to the GitHub page.

Because legate.core is the central component of legate.numpy, building legate.core is the first step and isn’t too difficult. The instructions on GitHub are pretty good. I used the Anaconda distribution of Python that has been installed into /usr/local/anaconda3 by root. The startup definitions for Anaconda were in my .bashrc file, so it ran just fine before building legate.core.

The laptop I used has the following properties:

  • CPU: Intel(R) Core(TM) i5-10300H CPU @2.50GHz
    • Processor base frequency: 2.5GHz
    • Max turbo frequency: 4.5GHz
    • Cache: 8MB
    • Four cores (eight with hyperthread)
    • 45W thermal design power (TDP)
    • 8GB DDR4-2933 memory
    • Maximum of two memory channels
    • Memory bandwidth: 45.8GBps
  • Nvidia GeForce 1650 GPU
    • Architecture: Turing (TU117)
    • Memory: 4GB GDDR5
    • Memory speed: 8bps
    • Memory bandwidth: 128GBps
    • Memory bus: 8-bit
    • L2 cache: 1MB
    • TDP: 75W
    • Base clock: 1,485GHz
    • Boost clock: 1,665MHz
    • 896 CUDA cores

The laptop runs Ubuntu 20.04 with the 455.45.01 Nvidia driver and CUDA 11.2. All software was installed from the Apt repository for the specific distribution version.

The legate.coreprerequisites are fairly simple:

  • PyArrow
  • NumPy
  • CUDA ≥ 8.0
  • C++-compatible compiler (g++ in this case)

Because I wanted to build legate.core to run on GPUs, I installed CUDA; then, I built legate.coreas the rootuser with the command:

# ./ --cuda --with-cuda /usr/local/cuda-11 --arch volta --install-dir /usr/local/legate

Although I had tried to use the CUDA that was installed as part of Nvidia’s HPC SDK, I could never get the proper paths for CUDA, so I had to remove the paths to the HPC SDK while building Legate. Note that I used --arch volta because the GeForce GTX 1650 in my laptop is a Turing-based architecture, which is newer than Volta, and I installed Legate into /usr/local/.