Lead Image © Lucy Baldwin, 123RF.com

Lead Image © Lucy Baldwin, 123RF.com

Finding your way around a GPU-accelerated cloud environment

Speed Racer

Article from ADMIN 63/2021
By
We look at the tools needed to discover, configure, and monitor an accelerated cloud instance, employing the simplest possible tool to get the job done.

Raw compute performance horsepower has migrated from the central processing unit into dedicated chips over the last decade. Starting with specialized graphic processing units (GPUs), it has evolved into ever more specialized options for artificial intelligence use (tensor processing unit – TPU). Some emerging applications even make use of user-programmed field-programmable gate arrays (FPGAs) to execute customized in-silicon logic. These enhanced computing capabilities require adopting domain-specific data parallel programming models, of which NVidia's CUDA [1] is the most widely used.

The rise of the cloud has made access to the latest hardware cost effective even for individual engineers, because coders can purchase time on accelerated cloud instances from Amazon Web Services (AWS), Microsoft Azure, Google, or Linode, to name but a few options. This month I look at the tools needed to discover, configure, and monitor an accelerated cloud instance in my trademark style, employing the simplest possible tool that will get the job done.

Knock, Knock. Who's There?

On logging in to an environment configured by someone else (or by yourself a few weeks prior), the first question you would pose is just what acceleration capabilities, if any, are available. This is quickly discovered with the command:

$ ec2metadata | grep instance-type
instance-type: p3.2xlarge

Variations of the ec2metadata tool query the AWS metadata service, helping you identify the instance's type. Alon Swartz's original ec2metadata [2] is found in Ubuntu releases like Bionic (18.04), on which the Deep Learning Amazon Machine Image (DLAMI) is currently based [3]. It has been replaced since by ec2-metadata [4] as Canonical decided to standardize on Amazon's implementation of the tool beginning with Groovy Gorilla (20.10).

Documentation indicates this instance type is equipped with an NVidia Tesla V100 [5] datacenter accelerator (Figure 1). Built on the basis of the Volta microarchitecture [6], the V100 supports CUDA 7.0, and it was the first to ship tensor cores designed for superior machine learning performance over regular CUDA GPU cores. You can also find this out without resorting to references by interrogating the hardware with lspci (Figure 2); equivalent information can also be obtained with lshw.

Figure 1: EC2Instances.info is an essential resource to query instance types.
Figure 2: Identifying the GPU hardware and its properties with lspci.

Stopwatch

A tidy and convenient utility to keep tabs on what is going on with the GPU is called gpustat [7]. Load information and memory utilization can be sourced alongside temperature, power, and fan speed. An updating watch [8] view is also present. After installing from the Python package repository (pip3 install gpustat), try the following:

$ gpustat -P
[0] Tesla V100-SXM2-16GB | 37'C, 0 %, 24 / 300 W | 0 / 16160 MB |

One GPU is present, running at a cool 37 Celsius and drawing 24W while doing absolutely nothing. To proceed further, you need to find a load generator, because trusted standbys stress [9] and stress-ng [10] do not yet supply GPU stressors. Obvious choices include the glmark2 [11] and glxgears [12] graphic load generators. Both tools measure a frame-rate benchmark and require a valid X11 display. A headless alternative is supplied by the password recovery utility hashcat [13], which includes a built-in GPU-accelerated hashing benchmark. Version 6 supplies a CUDA driver and can be found on the developer's website. Launching the nightmare workload profile will keep the GPU busy for some time, giving you a few minutes to test tools (Figure 3). Try it with:

Figure 3: Hashcat's benchmark mode is a great self-contained GPU load generator.
$ sudo ./hashcat-6.1.1/hashcat.bin -b -O -w 4

Figure 4 shows the results with gpustat. Temperature is exceeding 55 degrees, and power consumption is approaching 230W; 2.5GB of memory is in use, and GPU load is now at 100%. At the same time, I took the opportunity to call on nvidia-smi [14], the NVidia systems management interface utility, for an alternative view of the system's status. The nvidia-smi utility is the official GPU-configuration tool supplied by the vendor. Available on all supported Linux distributions, it encompasses all recent NVidia hardware. (See "Intel and AMD Radeon GPU Tools" box.)

Intel and AMD Radeon GPU Tools

Users of hardware not manufactured by NVidia need not fear; Linux tools exist for their GPUs as well. The intel-gpu-tools package supplies the intel_gpu_top [15] command, which will produce a process and load listing (but alas no curses chart) on machines equipped with Intel hardware. For AMD chips, the radeontop [16] command provided by the eponymous package will do the trick – and it provides an interesting take on terminal graphics, showcasing loads in different parts of the rendering pipeline.

Another interesting bit of software coming out of Intel is the oneAPI Toolkit, which stands out for its ability to bridge with one data-parallel abstraction execution across CPUs, GPUs, and even FPGAs [17].

Figure 4: Warming up a corner of EC2's us-east-1 datacenter with a benchmark.

The Real McCoy

This tour must inevitably end with a top-like tool. Maxime Schmitt's nvtop [18] is packaged in the universe repository starting with Focal (20.04), but it is easily compiled from source on the 18.04-based DLAMI: I was able to do so without incident in a few minutes. Packaged for the most popular Linux distributions, nvtop can handle multiple GPUs, and it produces an intuitive in-terminal plot. Conveniently, it can distinguish between graphic and compute workloads in its process listing and plots the load on each GPU alongside the use of GPU memory. The intermittent nature of Hashcat's many-part benchmark is shown clearly in a test (Figure 5).

Figure 5: nvtop is my choice for the most comprehensive GPU monitoring tool.

One last, excellent option comes from AWS itself in the form of the CloudWatch service. CloudWatch does not track GPU metrics by default, but the DLAMI documentation provides instructions on how to configure and authorize a simple Python script reporting temperature, power consumption, GPU, and GPU memory usage to the cloud service [19]. The results are great (Figure 6), and the data is stored in the service that you should be already using to monitor your cloud instances, making a case for convenience and integration. You can customize the granularity of the sampling by modifying the supplied script. Please take note of a minor inconsistency in the documentation: The store_resolution variable is really named store_reso.

Figure 6: Putting the monitoring data where it belongs: in the CloudWatch service.

Infos

  1. CUDA: https://developer.nvidia.com/cuda-zone
  2. Alon Swartz – ec2metadata: https://www.turnkeylinux.org/blog/amazon-ec2-metadata
  3. DLAMI: https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html
  4. ec2-metadata: http://manpages.ubuntu.com/manpages/groovy/en/man8/ec2-metadata.8.html
  5. NVidia V100 Tensor Core GPU: https://www.nvidia.com/en-us/data-center/v100/
  6. NVidia Volta microarchitecture: https://en.wikipedia.org/wiki/Volta_(microarchitecture)
  7. Jongwook Choi – gpustat: https://pypi.org/project/gpustat/
  8. watch (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/watch.1.html
  9. Amos Waterland – stress (1) man page: http://manpages.ubuntu.com/manpages/bionic/man1/stress.1.html
  10. Colin King – stress-ng (1) man page: http://manpages.ubuntu.com/manpages/bionic/man1/stress-ng.1.html
  11. glmark2 (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/glmark2.1.html
  12. glxgears (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/glxgears.1.html
  13. Jens Steube – hashcat v6: https://hashcat.net/hashcat/
  14. nvidia-smi: https://developer.nvidia.com/nvidia-system-management-interface
  15. intel_gpu_top (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/intel_gpu_top.1.html
  16. radeontop: https://github.com/clbr/radeontop
  17. Intel oneAPI Toolkits: https://software.intel.com/content/www/us/en/develop/tools/oneapi/all-toolkits.html
  18. Maxime Schmitt – nvtop: https://github.com/Syllo/nvtop
  19. GPU monitoring with CloudWatch: https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-gpu-monitoring-gpumon.html

The Author

Federico Lucifredi (@0xf2) is the Product Management Director for Ceph Storage at Red Hat and was formerly the Ubuntu Server Product Manager at Canonical and the Linux "Systems Management Czar" at SUSE. He enjoys arcane hardware issues and shell-scripting mysteries and takes his McFlurry shaken, not stirred. You can read more from him in the new O'Reilly title AWS System Administration .

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus