OpenCL

The Framework for GPU Applications

When you create a regular C++ program for execution on the CPU, the data is stored in the RAM and the CPU registers. If you want to use the GPU to process data, you first must feed the data to the GPU. To do this, you must copy the data you want to process (across the bus) from RAM to video memory. The results are retrieved in the same way but in the opposite order.

The video card runs a thread on each core. The threads all run the same kernel function (not to be confused with the convolution kernel described above), but with a different index for each thread. All threads that are active at the same time perform precisely the same instruction at a given point in time. The kernel is a piece of C source code that tells each thread what it has to do (depending on its index).

Depending on the scale of the problem, you will need to define the number of threads (in this case, the number of pixels in the output image) and the data (in this case, the input image, output image, convolution kernel) and then start the kernel.

Firing Up the GPU

Listing 3 shows OpenCL Code for the convolution example.

Listing 3: OpenCL Code
 001 #define __CL_ENABLE_EXCEPTIONS
 002 #include "convolve.hpp"
 003 #include "timer.hpp"
 004 
 005 #include "CL/cl.hpp"
 006 #include <limits> // uchar max, min
 007 #include <assert.h>
 008 #include <iostream>
 009 
 010 /**
 011  * The OpenCL kernel for image convolution.
 012  */
 013 const char* kernelSource = "
 014   __kernel void convolveKernel(\
 015     global uchar *in,\
 016     uint inWidth,\
 017     uint inHeight,\
 018     global uint *out,\
 019     uint outWidth,\
 020     uint outHeight,\
 021     global float *convKernel,\
 022     uint convKernelWidth,\
 023     uint convKernelHeight)\
 024 {\
 025   size_t x = get_global_id(0);\
 026   size_t y = get_global_id(1);\
 027   \
 028   /* Kill unneeded threads */\
 029   if(x >= outWidth || y >= outHeight)\
 030   {\
 031     return;\
 032   }\
 033   \
 034   float convolutionSum = 0.0f;\
 035   for(size_t ky = 0; ky < convKernelHeight; ++ky)\
 036   {\
 037     for(size_t kx = 0; kx < convKernelWidth; ++kx)\
 038         {\
 039           convolutionSum += (float) in[(y + ky) * inWidth + (x + kx)] * convKernel[ky * convKernelWidth + kx];\
 040         }\
 041   }\
 042   out[y * outWidth + x] = (uint) clamp(convolutionSum, 0, 255);\
 043 }";
 044 
 045 /**
 046  * Convolve a grayscale image with a convolution kernel on the GPU using OpenCL.
 047  */
 048 grayImage convolveGPU(grayImage in, convolutionKernel convKernel)
 049 {
 050   grayImage out;
 051   out.width = in.width ‑ (convKernel.width ‑ 1);
 052   out.height = in.height ‑ (convKernel.height ‑ 1);
 053   out.data = new uchar[out.height * out.width];
 054 
 055   // Platforms
 056   std::vector< cl::Platform > platforms;
 057   cl::Platform::get(&platforms);
 058   assert(platforms.size() > 0);
 059 
 060   // Devices
 061   std::vector<cl::Device> devices;
 062   platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
 063   assert(devices.size() > 0);
 064   assert(devices[0].getInfo<CL_DEVICE_TYPE>() == CL_DEVICE_TYPE_GPU);
 065 
 066   // Context
 067   cl::Context context(devices);
 068 
 069   // Create GPU buffers
 070   cl::Buffer inGPU(context, CL_MEM_READ_ONLY, in.width * in.height * sizeof(uchar));
 071   cl::Buffer convKernelGPU(context, CL_MEM_READ_ONLY, convKernel.width * convKernel.height * sizeof(float));
 072   cl::Buffer outGPU(context, CL_MEM_WRITE_ONLY, out.width * out.height * sizeof(uint));
 073 
 074   // Commandqueue
 075   cl::CommandQueue queue(context, devices[0], 0);
 076 
 077   // Upload in.data to inGPU
 078   queue.enqueueWriteBuffer(inGPU,false, // FIFO0,in.width * in.height * sizeof(uchar),in.data);
 079 
 080   // Upload kernel.data to convKernelGPU
 081   queue.enqueueWriteBuffer(convKernelGPU,true, // Blocking for correct timing0,convKernel.width * convKernel.height *  sizeof(float),convKernel.data);
 082 
 083   // Program
 084   cl::Program::Sources source(1, std::make_pair(kernelSource, strlen(kernelSource)));
 085 
 086   cl::Program program(context, source);
 087   program.build(devices);
 088 
 089   // Ranges
 090   size_t localWidth = 16;
 091   size_t localHeight = 16;
 092 
 093   cl::NDRange localRange(localWidth, localHeight);
 094   cl::NDRange globalRange(((out.width‑1)/localWidth+1) * localWidth, ((out.height‑1)/localHeight+1) * localHeight);
 095 
 096   // Run kernel
 097   cl::Kernel kernel(program, "convolveKernel");
 098   cl::KernelFunctor func = kernel.bind(queue, globalRange, localRange);
 099 
 100   cl::Event event = func(inGPU, in.width, in.height, outGPU, out.width, out.height, convKernelGPU, convKernel.width, convKernel.height);
 101   event.wait();
 102 
 103   // Download result
 104   uint *outTemp = new uint[out.width * out.height];
 105   queue.enqueueReadBuffer(outGPU,true,0,out.width * out.height * sizeof(uint),outTemp);
 106 
 107   // Convert uint array to uchar array
 108   for(size_t i = 0; i < out.width * out.height; ++i)
 109   {
 110     out.data[i] = (uchar) outTemp[i];
 111   }
 112 
 113   delete outTemp;
 114   return out;
 115 }

The OpenCL C++ bindings are included with #include <CL/cl.hpp>. To use the exceptions in the bindings rather than normal C error codes, I need to #define __CL_ENABLE_EXCEPTIONS. All of the classes are located in the cl:: namespace. To tell the linker what to link against, I add an ‑lOpenCL argument to the g++ parameter list.

In contrast to CUDA, OpenCL doesn’t create platform-dependent code until run time. This task of creating platform-dependent code means that OpenCL first has to discover the hardware that will be running the parallel code. To allow this to happen, I create a cl::Platform and a vector of cl::Device (Listing 3, lines 56, 61).

Note the two different cl::Platform types: “full profile” and “embedded profile.” In this article, I will be looking at programming with the full profile type. 

Within each cl::Platform, multiple cl::Devices can exist. A cl::Device stands for the GPU or CPU. assert() in line 64 makes sure at least one device supports OpenCL and that the first device is a supported video card. To execute your program on the CPU, you have to take the CL_DEVICE_TYPE_CPU type device from the vector.

A cl::Context manages objects such as command queues, memory objects, kernels, and execution objects across multiple cl::Devices. I will look at these objects in more detail in the following sections. In this case, the cl::Context only manages the video card.

I still need to define a cl::CommandQueue. This is where action objects are stored for first-in, first-out (FIFO) execution in a normal case.