Improving performance with environment variables

Trick or No Trick

NVBLAS

NVidia has several libraries you can use when writing programs. Some of these libraries are standard conforming libraries, such as cuBLAS [9]. NVidia has taken cuBLAS and used it as part of a "drop-in" replacement BLAS library, NVBLAS, that provides BLAS level 3 routines [10]. NVBLAS uses cuBLAS, both of which are included as part of CUDA [11]; simply follow the directions for downloading and installing CUDA. For this article, I used the cuBLAS and NVBLAS that came with the NVidia HPC SDK, version 21.3.

Before using NVBLAS, you have to configure it. From the NVBLAS documentation [12], "It must be configured through an ASCII text file that describes how many and which GPUs can participate in the intercepted BLAS calls." To use NVBLAS, create the file nvblas.conf in the directory in which you are running the scripts. For the example in this article, the contents of the file I used were:

# This is the configuration file to use NVBLAS Library
NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so.0
NVBLAS_GPU_LIST 0
NVBLAS_AUTOPIN_MEM_ENABLED

The first line of the file defines the logfile where NVBLAS writes any log information. The next line defines the CPU-only BLAS library for cases in which there is no GPU routine. The code defaults to running on the CPU and falls through to the CPU BLAS library, which the NVBLAS_CPU_BLAS_LIB variable specifies for NVBLAS. In this case, I chose to use the OpenBLAS library.

The third line lists the GPU devices that should be used. The numbering begins with 0. In this case, the laptop only has one NVidia GPU, so only one is listed. You can also use the keyword ALL to define all the GPUs in the system. The last line is something I used from an article about NVBLAS with Octave [13]. After configuring nvblas.conf, you have to take two steps to run Octave. The first step is to export the NVBLAS_CONFIG_FILE environment variable that points to the location of the nvblas.conf file:

export NVBLAS_CONFIG_FILE=$HOME/PROJECTS/OCTAVE/nvblas.conf

This environment variable just points to the ASCII configuration file you created. The second step is the run command itself, which uses the LD_PRELOAD trick to load NVBLAS first:

LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/math_libs/11.2/targets/x86_64-linux/lib/libnvblas.so.11.4.1.1026 octave-cli ./sgemm.m

The command begins by defining LD_PRELOAD, pointing to the NVBLAS library, which is then followed by the command that runs Octave (octave-cli). To run the script, you can simply concatenate the two commands together (I tend to write a one-line Bash script for this). The results for the single- and double-precision scripts are shown in Table 4.

Table 4

Octave Results with the NVBLAS Library

  Single-Precision, GPU Double-Precision, GPU
N Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS
2 0.001167 0.000014 0.001007 0.000016
4 0.000076 0.001678 0.000069 0.001864
8 0.000061 0.016777 0.000061 0.016777
16 0.000061 0.134218 0.000069 0.119305
32 0.000076 0.858993 0.000076 0.858993
64 0.000099 5.286114 0.000145 3.616815
128 0.000542 7.74304 0.000603 6.958934
256 0.000549 61.083979 0.001152 29.126136
512 0.016685 16.087962 0.012955 20.721067
1,024 0.008904 241.195353 0.039238 54.72975
2,048 0.01741 986.765913 0.250496 68.583432
4,096 0.093765 1465.776933 1.500099 91.619911
8,192 0.643051 1709.835418 12.03125 91.387979

The strange "blurp" in the results for N =512 I cannot explain, but it happens very frequently. Notice the strange results at N =256 and N =512 that also happened when using the CPU.

For the CPU results, the double-precision results are about half the single-precision results, which is expected. However, the GPU double-precision performance is less than half of the single-precision results, because the GPU used (the GeForce 1650) is a consumer-grade GPU with the focus primarily on 32-bit performance. However, as you can tell, it can run double-precision code, just not as well as the data center GPUs that focus on 64-bit performance.

Summary

The PRELOAD trick is something of a rite of passage for new system administrators. When they find out about the trick, it is a revelation because of how flexible it can be. Soon, it is no longer a trick but a part of what the admin uses every day. I hope the simple example of LD_PRELOAD in this article with GPUs for computation and without any code changes illustrates its utility.

If you knew of this trick but have forgotten it, or if you are just learning it, I hope this article proved useful.

The Author

Jeff Layton has been in the HPC business for almost 25 years (starting when he was 4 years old). He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Preload Trick

    By using  the LD_PRELOAD environment variable ,  you  can improve performance without making changes to applications.

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=