OpenCL

We take a close look at how to integrate graphics processors into your parallel programs with OpenCL.

OpenCL is a programming language for parallel environments. In this article, I look at how to get started with OpenCL. The example described here shows how to use OpenCL for parallel computations with the graphics processor (GPU), but this is just a sample of the many powers of OpenCL. 

In the past few months, more and more high-powered computers have leveraged their video cards to secure a place in the list of Top 500 supercomputing sites. It is possible to invoke the GPU directly to assist with generic application processing, but the data has to be parallelizable. In today’s machines, quad- and hexacore processors with four or six processor cores have become the norm. Each processor core can handle a different task at any given time. For example, while core 1 is accessing RAM, core 2 might be adding the numbers in two registers.

A graphics card, on the other hand, comprises several hundred cores, but they are not as versatile. Calculations performed by the GPU use SIMD/​SIMT [1], which means the same instruction is executed at the same time on all the active cores but applied to different data. This restriction imposes a limitation on the underlying problem, but programmers can still speed up certain types of applications using the GPU and OpenCL programming techniques.

On the GPU

Just a few years ago, video cards were designed solely for accelerating special applications. To perform such precise, generic calculations, programmers had to delve deep into their box of tricks and take a roundabout route to achieve their aims. For example, a scientist who wanted to add two large matrices could interpret them as textures with OpenGL, position them one in front of the other, and make the front half transparent. The rendered result is a texture that can then be interpreted as the resulting matrix. Although this might sound like a fairly simple addition, you’ll soon hit limits if you start to multiply, or introduce, loops, jumps, or conditions. At the beginning of 2007, NVidia took the first step toward providing some structure to this collection of tricks with its CUDA parallel computing architecture [2].

The CUDA architecture gives programmers an interface for solving arbitrary problems with parallelizable data without taking the roundabout approach that I mentioned. Also, C for CUDA gives developers an easy option for parallel computing with the video card. However, CUDA is restricted to NVidia cards, and you will need to use a special compiler (namely, the NVCC in the case of C for CUDA), which makes integrating CUDA with existing projects more difficult.

Faced with this problem, Apple took the initiative and (along with AMD, IBM, and NVidia) submitted a specification to the Khronos Group, which promotes the OpenGL specification. Open Computing Language 1.0 (OpenCL) was released late in 2008 [3].

OpenCL now provides a cross-platform standard for creating heterogeneous, parallel programs for CPUs, GPUs, and other processors. The presence of the OpenCL standard makes it possible to perform generic calculations on the GPU independently of the platform and manufacturer (see the “System Requirements” box).

OpenCL offers exciting opportunities for software developers who want to accelerate their applications on contemporary PCs. However, do keep in mind that the video card isn’t the answer for everything. 

A video card–based solution works best for problems that involve serious amounts of computation and don’t rely on double precision. Also, running a problem on the GPU involves some overhead and only pays dividends for problems that exceed a certain input size.

Installing OpenCL

The OpenCL specification lets individual manufacturers (AMD, NVidia, Intel, etc.) offer implementations that are slightly different. What you mainly get from the manufacturers is a library and the matching header file. 

Besides the original specification for C, C++ bindings have been included by default since OpenCL 1.1. These bindings facilitate the handling of OpenCL data structures and help keep the code shorter. Because the C++ bindings wrap the C functions inline, the process doesn’t involve any computational overhead.

If you have a supported AMD/​ATI video card, or if want to use your CPU, you must install the ATI Stream SDK [6]. After downloading and unpacking the ATI Stream SDK, it is also a good idea to read the guide. Basically, the path variable $LD_LIBRARY_PATH is extended to include the libOpenCL.so library. A user on the AMD forum also kindly provided a .deb package for Ubuntu, which automatically installs the library and header files in the right system folders [7]. In this case, you do not need to change $LD_​LIBRARY_PATH.

Users of NVidia video cards will need the CUDA Toolkit [8]. After downloading and unpacking the file, become root to install the toolkit. The standard path for the cl.h header file is /usr/local/cuda/include; the library belongs in /usr/local/cuda/lib or in /usr/local/cuda/lib64 if you have a 64-bit system.

You will need to tell the linker the path to the library by adding it to your $LD_LIBRARY_PATH. Unfortunately, the current toolkit (3.2) only contains the OpenCL 1.0 libraries, not the C++ bindings that I referred to previously. Registered developers can download the pre-release driver for OpenCL 1.1 [9].

As a temporary workaround, you can add the C++ bindings header file (cl.hpp) to the system global include folder  /usr/local/include/CL/cl.hpp or copy it into your own project. You can download the header file from the Khronos website [10].

Getting Started

To demonstrate OpenCL in action, I will start with a simple example to solve a parallelizable problem with OpenCL on a video card. The starting point is a C++ implementation that convolves a grayscale image. 

Convolution [11] is a mathematical operation performed on two functions to produce a third function. In this case, the components consist of an input image, a second matrix known as the convolution kernel, and an output image. Applications for this technique include image manipulations, such as Gaussian blurs, sharpening, motion blur, and gradients. 

Figure 2 shows examples of various convolutions based on the input image shown in Figure 1.

Figure 1: The input image: Larry Ewing's Tux.

Figure 2: Convolutions of Tux.

The first step is to store the grayscale image in a simple structure whose height, width, and data are stored in a uchar array (Listing 1, lines 1-5). The *data field concatenates the data values of the lines. The following image:

A B C

D E F

would thus be represented as width = 3, height = 2, data = {A, B, C, D, E, F}. The convolution kernel is similarly structured but uses float instead of uchar, as you can see in line 10 of the listing. The convoluting process doesn’t require much code, as you can see in line 26 of Listing 1.

Listing 1: CPU Solution
 01 typedef struct {
 02     uint width;
 03     uint height;
 04     uchar *data;
 05 } grayImage;
 06 
 07 typedef struct {
 08     uint width;
 09     uint height;
 10     float *data;
 11 } convolutionKernel;
 12 
 13 
 14 /**
 15  * Clamp value to limits of uchar.
 16  */
 17 uchar clampuchar(int value)
 18 {
 19   return (uchar) std::min((int) std::numeric_limits::max(), std::max(value, (int) std::numeric_limits::min()));
 20 }
 21 
 22 /**
 23  * Convolve a grayscale image with a convolution 
 24  * kernel on the CPU.
 25  */
 26 grayImage convolve(grayImage in, convolutionKernel kernel)
 27 {
 28   grayImage out;
 29   out.width = in.width ‑ (kernel.width ‑ 1);
 30   out.height = in.height ‑ (kernel.height ‑ 1);
 31   out.data = new uchar[out.height * out.width];
 32 
 33   // Iterate over all pixels of the output image
 34   for(size_t y = 0; y < out.height; ++y)
 35   {
 36     for(size_t x = 0; x < out.width; ++x)
 37     {
 38        float convolutionSum = 0.0f;
 39 
 40         // Iterate over convolution kernel
 41         for(size_t ky = 0; ky < kernel.height; ++ky)
 42         {
 43             for(size_t kx = 0; kx < kernel.width; ++kx)
 44             {
 45                // Add weighted value to convolution sum
 46                convolutionSum += in.data[(y + ky) * in.width + (x + kx)] * kernel.data[ky * kernel.width + kx];
 47             }
 48         }
 49         // Clamp values to {0, ..., 255} and store them
 50         out.data[y * out.width + x] = clampuchar((int) convolutionSum);
 51     }
 52   }
 53   return out;
 54 }

You probably immediately noticed that the convolution process involves four nested for loops. The two outer loops in lines 34 and 36 iterate over all the pixels in the target image. The two inner loops in lines 41 and 43 iterate over all kernel positions and weight the pixels of the input image with the kernel values. The weighted values are added and written to the pixels (x, y) of the output image (lines 46, 50). In this case, the width and height of the output image are kernel.width ‑ 1 and kernel.height ‑ 1. To maintain the size of the input image, you need to enlarge the outer pixels of the input image in the outward direction, but this action means more side effects, which could compromise clarity.

The clampuchar(int value) method (line 50) prevents values outside of the uchar definition ({0, …, 255}) from creating under- or overflow; instead, they are mapped to the extremes (0, 255). clampuchar(260) == 255, clampuchar(‑5) == 0, and clampuchar(14) == 14 are just a few examples.

Parallelizable?

If you look at the structure of the convolution, you will see that exactly the same instructions are performed for each pixel in the output image, but for neighboring data. 

If you want to execute the convolution kernel in parallel for each pixel of the output image, the pseudocode would look something like Listing 2. Basically, you just replace the two outer loops with a fictive “do in parallel … .”

Listing 2: Pseudocode for Execution in Parallel
 01 // Do in parallel
 02 for all y in {0, ..., out.height ‑ 1} and all x in {0, ..., out.width ‑ 1}
 03 {
 04     float convolutionSum = 0.0f;
 05 
 06     // Iterate over convolution kernel
 07     for(size_t ky = 0; ky < kernel.height; ++ky)
 08     {
 09       for(size_t kx = 0; kx < kernel.width; ++kx)
 10 
 11       {
 12         // Add weighted value to convolution sum
 13         convolutionSum += in.data[(y + ky) * in.width + (x + kx)] * kernel.data[ky * kernel.width + kx];
 14       }
 15     }
 16     // Clamp values to {0, ..., 255} and store them
 17     out.data[y * out.width + x] = clampuchar((int) convolutionSum);
 18 }

The Framework for GPU Applications

When you create a regular C++ program for execution on the CPU, the data is stored in the RAM and the CPU registers. If you want to use the GPU to process data, you first must feed the data to the GPU. To do this, you must copy the data you want to process (across the bus) from RAM to video memory. The results are retrieved in the same way but in the opposite order.

The video card runs a thread on each core. The threads all run the same kernel function (not to be confused with the convolution kernel described above), but with a different index for each thread. All threads that are active at the same time perform precisely the same instruction at a given point in time. The kernel is a piece of C source code that tells each thread what it has to do (depending on its index).

Depending on the scale of the problem, you will need to define the number of threads (in this case, the number of pixels in the output image) and the data (in this case, the input image, output image, convolution kernel) and then start the kernel.

Firing Up the GPU

Listing 3 shows OpenCL Code for the convolution example.

Listing 3: OpenCL Code
 001 #define __CL_ENABLE_EXCEPTIONS
 002 #include "convolve.hpp"
 003 #include "timer.hpp"
 004 
 005 #include "CL/cl.hpp"
 006 #include  // uchar max, min
 007 #include 
 008 #include 
 009 
 010 /**
 011  * The OpenCL kernel for image convolution.
 012  */
 013 const char* kernelSource = "
 014   __kernel void convolveKernel(\
 015     global uchar *in,\
 016     uint inWidth,\
 017     uint inHeight,\
 018     global uint *out,\
 019     uint outWidth,\
 020     uint outHeight,\
 021     global float *convKernel,\
 022     uint convKernelWidth,\
 023     uint convKernelHeight)\
 024 {\
 025   size_t x = get_global_id(0);\
 026   size_t y = get_global_id(1);\
 027   \
 028   /* Kill unneeded threads */\
 029   if(x >= outWidth || y >= outHeight)\
 030   {\
 031     return;\
 032   }\
 033   \
 034   float convolutionSum = 0.0f;\
 035   for(size_t ky = 0; ky < convKernelHeight; ++ky)\
 036   {\
 037     for(size_t kx = 0; kx < convKernelWidth; ++kx)\
 038         {\
 039           convolutionSum += (float) in[(y + ky) * inWidth + (x + kx)] * convKernel[ky * convKernelWidth + kx];\
 040         }\
 041   }\
 042   out[y * outWidth + x] = (uint) clamp(convolutionSum, 0, 255);\
 043 }";
 044 
 045 /**
 046  * Convolve a grayscale image with a convolution kernel on the GPU using OpenCL.
 047  */
 048 grayImage convolveGPU(grayImage in, convolutionKernel convKernel)
 049 {
 050   grayImage out;
 051   out.width = in.width ‑ (convKernel.width ‑ 1);
 052   out.height = in.height ‑ (convKernel.height ‑ 1);
 053   out.data = new uchar[out.height * out.width];
 054 
 055   // Platforms
 056   std::vector< cl::Platform > platforms;
 057   cl::Platform::get(&platforms);
 058   assert(platforms.size() > 0);
 059 
 060   // Devices
 061   std::vector devices;
 062   platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
 063   assert(devices.size() > 0);
 064   assert(devices[0].getInfo() == CL_DEVICE_TYPE_GPU);
 065 
 066   // Context
 067   cl::Context context(devices);
 068 
 069   // Create GPU buffers
 070   cl::Buffer inGPU(context, CL_MEM_READ_ONLY, in.width * in.height * sizeof(uchar));
 071   cl::Buffer convKernelGPU(context, CL_MEM_READ_ONLY, convKernel.width * convKernel.height * sizeof(float));
 072   cl::Buffer outGPU(context, CL_MEM_WRITE_ONLY, out.width * out.height * sizeof(uint));
 073 
 074   // Commandqueue
 075   cl::CommandQueue queue(context, devices[0], 0);
 076 
 077   // Upload in.data to inGPU
 078   queue.enqueueWriteBuffer(inGPU,false, // FIFO0,in.width * in.height * sizeof(uchar),in.data);
 079 
 080   // Upload kernel.data to convKernelGPU
 081   queue.enqueueWriteBuffer(convKernelGPU,true, // Blocking for correct timing0,convKernel.width * convKernel.height *  sizeof(float),convKernel.data);
 082 
 083   // Program
 084   cl::Program::Sources source(1, std::make_pair(kernelSource, strlen(kernelSource)));
 085 
 086   cl::Program program(context, source);
 087   program.build(devices);
 088 
 089   // Ranges
 090   size_t localWidth = 16;
 091   size_t localHeight = 16;
 092 
 093   cl::NDRange localRange(localWidth, localHeight);
 094   cl::NDRange globalRange(((out.width‑1)/localWidth+1) * localWidth, ((out.height‑1)/localHeight+1) * localHeight);
 095 
 096   // Run kernel
 097   cl::Kernel kernel(program, "convolveKernel");
 098   cl::KernelFunctor func = kernel.bind(queue, globalRange, localRange);
 099 
 100   cl::Event event = func(inGPU, in.width, in.height, outGPU, out.width, out.height, convKernelGPU, convKernel.width, convKernel.height);
 101   event.wait();
 102 
 103   // Download result
 104   uint *outTemp = new uint[out.width * out.height];
 105   queue.enqueueReadBuffer(outGPU,true,0,out.width * out.height * sizeof(uint),outTemp);
 106 
 107   // Convert uint array to uchar array
 108   for(size_t i = 0; i < out.width * out.height; ++i)
 109   {
 110     out.data[i] = (uchar) outTemp[i];
 111   }
 112 
 113   delete outTemp;
 114   return out;
 115 }

The OpenCL C++ bindings are included with #include <CL/cl.hpp>. To use the exceptions in the bindings rather than normal C error codes, I need to #define __CL_ENABLE_EXCEPTIONS. All of the classes are located in the cl:: namespace. To tell the linker what to link against, I add an ‑lOpenCL argument to the g++ parameter list.

In contrast to CUDA, OpenCL doesn’t create platform-dependent code until run time. This task of creating platform-dependent code means that OpenCL first has to discover the hardware that will be running the parallel code. To allow this to happen, I create a cl::Platform and a vector of cl::Device (Listing 3, lines 56, 61).

Note the two different cl::Platform types: “full profile” and “embedded profile.” In this article, I will be looking at programming with the full profile type. 

Within each cl::Platform, multiple cl::Devices can exist. A cl::Device stands for the GPU or CPU. assert() in line 64 makes sure at least one device supports OpenCL and that the first device is a supported video card. To execute your program on the CPU, you have to take the CL_DEVICE_TYPE_CPU type device from the vector.

A cl::Context manages objects such as command queues, memory objects, kernels, and execution objects across multiple cl::Devices. I will look at these objects in more detail in the following sections. In this case, the cl::Context only manages the video card.

I still need to define a cl::CommandQueue. This is where action objects are stored for first-in, first-out (FIFO) execution in a normal case.

Copying Data

To be able to store data (input image, convolution kernel, output image) in video memory, I first need to create cl::Buffer objects for the input image, output image, and convolution kernel (lines 70-72). 

Next, I pass in the managing context, access mode, and size to the managing context. This is much like a malloc() in the GPU RAM. For a cl::Buffer, the access modes are CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, and CL_MEM_READ_WRITE. This refers to access by the video card. The host always has full access to the memory objects. For each memory object, I need to give the size in bytes to allocate in typical malloc() style. If the GPU RAM is not big enough, an exception is thrown.

The next step is to copy the input data from the host to the cl::Buffer. To copy the input data, I line up copy actions in the command queue using enqueueWriteBuffer() (line 81).

The parameters that I need to pass in to the command are the target buffer, whether or not this is a blocking action (in other words, execution doesn’t return to the host until the action has been completed), an offset, the size to copy in bytes, and a pointer to the host memory address.

The blocking parameter for the first copy action is set to false. The action is lined up in the command queue, and the host regains control of the execution flow (while the copy action runs in the background) and can thus set up the second copy action. This second action is blocking to prevent the host from carrying on work until the second copy action has been completed.

Because the command queue uses the FIFO approach, both copy actions are completed after this, which allows for correct timing of I/​O actions. If no dependencies exist, or only linear dependencies between the actions, you can use non-blocking actions for everything, apart from the last action that you put in the queue.

The Kernel

I need to tell each thread on the video card what to do, depending on its index value. The kernel source code is defined as a const char* and named kernelSource (line 13). I use the __kernel keyword to tell the OpenCL run-time compiler that this is an OpenCL kernel.

As a subset of ISO C99, OpenCL C also features syntax and semantics similar to the standard. In addition to this, OpenCL includes a number of useful, built-in functions [13].

Three parameters in the parameter list are tagged with the global keyword. These parameters identify pointers in the global video memory; the pointers reference the cl::Buffers. Using the get_global_id() built-in function, a unique index is assigned to each thread.

The parameter 0 is for the x dimension along the output image (line 25), and 1 stands for the y dimension. The if instruction (line 29) terminates any threads that might have been started unnecessarily.

The main task for each thread is basically taken directly from the CPU implementation or from the pseudocode intermediate step (Listing 2). 

I only need to port the two inner loops (Listing 2, lines 7 and 9) because the threads process all the required x and y values in parallel. I use the built-in clamp() function (Listing 3, line 42) to store the convolution sum; its function is similar to clampuchar().

Attentive readers will have noticed that the cl::Buffer out stores unsigned integers rather than unsigned chars (Listing 3, line 72). The reason for this is that some video card models don’t support storing arbitrary addresses [14]. If you use addresses that are integer-aligned (every 4 bytes), you are definitely on the safe side. 

If you want to use arbitrary addresses, you must enable the OpenCL cl_khr_byte_addressable_store pragma. You can issue a getInfo() for the cl::Device to determine whether your GPU supports this feature.

Compiling and Calling the Kernel

The kernel source code must be compiled at run time by the OpenCL library. This is the only way to ensure platform-independence. I create a cl::Program object from the source code and call program.build() to compile it (line 87).

In addition to a unique global index, each thread also has a local index because of the hardware layout. All the threads in a work group share a small, but very fast, memory area. The non-optimized example described here does not use this memory but uses, instead, a generic, two-dimensional size of 16x16 for a work group (lines 90 and 91).

I need the total number of threads in each dimension to call the kernel. This number must be at least as large as the output image; however, it also must be divisible by the work group size in every dimension.

I can now load the previously compiled kernel from cl::Program and pass in the numbers I just calculated. Finally, I need to start the kernel by passing the parameters to the cl::​KernelFunctor. The parameters are the input image buffer, output image buffer, convolution kernel buffer, and corresponding metadata (size) (line 100).

Although it is necessary to create and populate the buffers in advance, simple data types like the image sizes can be passed in directly (i.e., without using enqueueWriteBuffer()). A kernel call is always non-blocking. You can use a cl::Event with a .wait() to wait for the call to complete.

The threads from the kernel call write their results to the outGPU buffer. The results are copied from this buffer to host memory using enqueueReadBuffer().

Sample Code

To keep things simple, link the code against the free libpng library [15]. You can use this library to read and write grayscale PNG files.

After entering make all to build, you can use the convolution kernels to process any grayscale PNG. The command syntax is convolucl <input.png> <output.png> [kernel index]. If you call convolucl without any parameters, it gives you a list of available kernels:

  0 Sobel
  1 Gauss 5x5
  2 Gauss 12x12
  3 Mean 3x3
  4 Mean 5x5
  5 Emboss
  ;6 Sharpen
  7 Motion blur

Figure 2 shows some of the results of these convolutions.

Summary and Conclusions

The GPU doesn’t guarantee a shorter execution time. On the one hand is the overhead for just-in-time compilation of the OpenCL kernel, and on the other, the data first must be copied to GPU RAM, which is computationally expensive. For special cases (large convolution kernels) and large volumes of data, you can still save time without even considering optimization strategies.

Far larger speed boosts are possible if you optimize the kernel functions. The threads in a work group share local memory, which is three orders of magnitude faster than the global GPU RAM. In the native kernel, the convolution kernel matrix elements are retrieved from global memory on access. If, instead, the elements were loaded once per work group into local memory, it would be possible to leverage the video card’s potential more efficiently.

Additionally, the image convolution has some potential for optimization if you restrict the problem to separable kernels. However, I purposely did without improvements of this kind to keep the problem simple and provide an easier entry into OpenCL. At the same time, you can view this article as a guide that will help you solve problems by running portions of your programs on the video card.

For more in-depth information, I recommend the NVidia OpenCL Programming Guide [13], which investigates the video card’s hardware architecture, as well as the sample code in the ATI and NVidia SDKs. OpenCL developers will not want to be without the OpenCL specification [16] and the documentation for the C++ bindings [17].

Info

 [1] Wikipedia SIMD:
[http://en.wikipedia.org/wiki/SIMD]
 [2] NVidia CUDA overview:
[http://www.nvidia.com/object/what_is_cuda_new.html]
 [3] Official OpenCL website:
[http://www.khronos.org/registry/cl/]
 [4] AMD/ATI system requirements, driver compatibility:
[http://developer.amd.com/gpu/ATIStreamSDK/pages/DriverCompatibility.aspx]
 [5] Supported NVidia GPUs:
[http://www.nvidia.com/object/cuda_gpus.html]
 [6] ATI Stream SDK download:
[http://developer.amd.com/gpu/ATIStreamSDK/downloads/]
 [7] ATI Stream SDK DEB package:
[http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125792]
 [8] CUDA toolkit download:
[http://developer.nvidia.com/object/cuda_3_2_downloads.html#Linux]
 [9] NVidia pre-release drivers:
[http://developer.nvidia.com/object/opencl.html]
 [10] OpenCL 1.1 C++ bindings header file:
[http://www.khronos.org/registry/cl/api/1.1/cl.hpp]
 [11] Wikipedia on convolution:
[http://en.wikipedia.org/wiki/Convolution]
 [11] Code for this article:
[http://www.linux-magazine.com/Resources/Article-Code] (choose issue 127)
 [13] NVidia OpenCL programming guide:
[http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf]
 [14] OpenCL extension cl_khr_byte_addressable_store:
[http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html]
 [15] libpng website:
[http://www.libpng.org/pub/png/libpng.html]
 [16] OpenCL documentation:
[http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/]
 [17] OpenCL C++ bindings documentation:
[http://www.khronos.org/registry/cl/specs/opencl‑cplusplus‑1.1.pdf]

The Author

Markus Roth is a student of Computer Science at the Karlsruhe Institute of Technology (KIT), Germany, where he is researching GPU-supported acceleration in computer vision at the Institute of Anthropomatics.