Improving performance with environment variables

Trick or No Trick

NVBLAS

NVidia has several libraries you can use when writing programs. Some of these libraries are standard conforming libraries, such as cuBLAS [9]. NVidia has taken cuBLAS and used it as part of a "drop-in" replacement BLAS library, NVBLAS, that provides BLAS level 3 routines [10]. NVBLAS uses cuBLAS, both of which are included as part of CUDA [11]; simply follow the directions for downloading and installing CUDA. For this article, I used the cuBLAS and NVBLAS that came with the NVidia HPC SDK, version 21.3.

Before using NVBLAS, you have to configure it. From the NVBLAS documentation [12], "It must be configured through an ASCII text file that describes how many and which GPUs can participate in the intercepted BLAS calls." To use NVBLAS, create the file nvblas.conf in the directory in which you are running the scripts. For the example in this article, the contents of the file I used were:

# This is the configuration file to use NVBLAS Library
NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so.0
NVBLAS_GPU_LIST 0
NVBLAS_AUTOPIN_MEM_ENABLED

The first line of the file defines the logfile where NVBLAS writes any log information. The next line defines the CPU-only BLAS library for cases in which there is no GPU routine. The code defaults to running on the CPU and falls through to the CPU BLAS library, which the NVBLAS_CPU_BLAS_LIB variable specifies for NVBLAS. In this case, I chose to use the OpenBLAS library.

The third line lists the GPU devices that should be used. The numbering begins with 0. In this case, the laptop only has one NVidia GPU, so only one is listed. You can also use the keyword ALL to define all the GPUs in the system. The last line is something I used from an article about NVBLAS with Octave [13]. After configuring nvblas.conf, you have to take two steps to run Octave. The first step is to export the NVBLAS_CONFIG_FILE environment variable that points to the location of the nvblas.conf file:

export NVBLAS_CONFIG_FILE=$HOME/PROJECTS/OCTAVE/nvblas.conf

This environment variable just points to the ASCII configuration file you created. The second step is the run command itself, which uses the LD_PRELOAD trick to load NVBLAS first:

LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/math_libs/11.2/targets/x86_64-linux/lib/libnvblas.so.11.4.1.1026 octave-cli ./sgemm.m

The command begins by defining LD_PRELOAD, pointing to the NVBLAS library, which is then followed by the command that runs Octave (octave-cli). To run the script, you can simply concatenate the two commands together (I tend to write a one-line Bash script for this). The results for the single- and double-precision scripts are shown in Table 4.

Table 4

Octave Results with the NVBLAS Library

	Single-Precision, GPU	Double-Precision, GPU
N	Elapsed Time (secs)	GFLOPS	Elapsed Time (secs)	GFLOPS
2	0.001167	0.000014	0.001007	0.000016
4	0.000076	0.001678	0.000069	0.001864
8	0.000061	0.016777	0.000061	0.016777
16	0.000061	0.134218	0.000069	0.119305
32	0.000076	0.858993	0.000076	0.858993
64	0.000099	5.286114	0.000145	3.616815
128	0.000542	7.74304	0.000603	6.958934
256	0.000549	61.083979	0.001152	29.126136
512	0.016685	16.087962	0.012955	20.721067
1,024	0.008904	241.195353	0.039238	54.72975
2,048	0.01741	986.765913	0.250496	68.583432
4,096	0.093765	1465.776933	1.500099	91.619911
8,192	0.643051	1709.835418	12.03125	91.387979

The strange "blurp" in the results for N =512 I cannot explain, but it happens very frequently. Notice the strange results at N =256 and N =512 that also happened when using the CPU.

For the CPU results, the double-precision results are about half the single-precision results, which is expected. However, the GPU double-precision performance is less than half of the single-precision results, because the GPU used (the GeForce 1650) is a consumer-grade GPU with the focus primarily on 32-bit performance. However, as you can tell, it can run double-precision code, just not as well as the data center GPUs that focus on 64-bit performance.

Summary

The PRELOAD trick is something of a rite of passage for new system administrators. When they find out about the trick, it is a revelation because of how flexible it can be. Soon, it is no longer a trick but a part of what the admin uses every day. I hope the simple example of LD_PRELOAD in this article with GPUs for computation and without any code changes illustrates its utility.

If you knew of this trick but have forgotten it, or if you are just learning it, I hope this article proved useful.

Infos

PATH: http://www.linfo.org/path_env_var.html
Shared objects: https://man7.org/linux/man-pages/man8/ld.so.8.html
BLAS: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
Octave: https://www.gnu.org/software/octave/index
Matlab: https://www.mathworks.com/
i5-103000H CPU: https://ark.intel.com/content/www/us/en/ark/products/201839/intel-core-i5-10300h-processor-8m-cache-up-to-4-50-ghz.html
NVidia GeForce 1650 GPU: https://www.nvidia.com/en-us/geforce/graphics-cards/gtx-1650/
OpenBLAS: https://en.wikipedia.org/wiki/OpenBLAS
cuBLAS: https://developer.nvidia.com/cublas
BLAS Level 3 routines: https://docs.nvidia.com/cuda/nvblas/index.html
CUDA: https://developer.nvidia.com/cuda-toolkit
NVBLAS documention: https://docs.nvidia.com/cuda/nvblas/index.html#configuration-file
NVBLAS with Octave: https://developer.nvidia.com/blog/drop-in-acceleration-gnu-octave/

The Author

Jeff Layton has been in the HPC business for almost 25 years (starting when he was 4 years old). He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales.

« Previous 1 2