Copying Data

To be able to store data (input image, convolution kernel, output image) in video memory, I first need to create cl::Buffer objects for the input image, output image, and convolution kernel (lines 70-72). 

Next, I pass in the managing context, access mode, and size to the managing context. This is much like a malloc() in the GPU RAM. For a cl::Buffer, the access modes are CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, and CL_MEM_READ_WRITE. This refers to access by the video card. The host always has full access to the memory objects. For each memory object, I need to give the size in bytes to allocate in typical malloc() style. If the GPU RAM is not big enough, an exception is thrown.

The next step is to copy the input data from the host to the cl::Buffer. To copy the input data, I line up copy actions in the command queue using enqueueWriteBuffer() (line 81).

The parameters that I need to pass in to the command are the target buffer, whether or not this is a blocking action (in other words, execution doesn’t return to the host until the action has been completed), an offset, the size to copy in bytes, and a pointer to the host memory address.

The blocking parameter for the first copy action is set to false. The action is lined up in the command queue, and the host regains control of the execution flow (while the copy action runs in the background) and can thus set up the second copy action. This second action is blocking to prevent the host from carrying on work until the second copy action has been completed.

Because the command queue uses the FIFO approach, both copy actions are completed after this, which allows for correct timing of I/​O actions. If no dependencies exist, or only linear dependencies between the actions, you can use non-blocking actions for everything, apart from the last action that you put in the queue.

The Kernel

I need to tell each thread on the video card what to do, depending on its index value. The kernel source code is defined as a const char* and named kernelSource (line 13). I use the __kernel keyword to tell the OpenCL run-time compiler that this is an OpenCL kernel.

As a subset of ISO C99, OpenCL C also features syntax and semantics similar to the standard. In addition to this, OpenCL includes a number of useful, built-in functions [13].

Three parameters in the parameter list are tagged with the global keyword. These parameters identify pointers in the global video memory; the pointers reference the cl::Buffers. Using the get_global_id() built-in function, a unique index is assigned to each thread.

The parameter 0 is for the x dimension along the output image (line 25), and 1 stands for the y dimension. The if instruction (line 29) terminates any threads that might have been started unnecessarily.

The main task for each thread is basically taken directly from the CPU implementation or from the pseudocode intermediate step (Listing 2). 

I only need to port the two inner loops (Listing 2, lines 7 and 9) because the threads process all the required x and y values in parallel. I use the built-in clamp() function (Listing 3, line 42) to store the convolution sum; its function is similar to clampuchar().

Attentive readers will have noticed that the cl::Buffer out stores unsigned integers rather than unsigned chars (Listing 3, line 72). The reason for this is that some video card models don’t support storing arbitrary addresses [14]. If you use addresses that are integer-aligned (every 4 bytes), you are definitely on the safe side. 

If you want to use arbitrary addresses, you must enable the OpenCL cl_khr_byte_addressable_store pragma. You can issue a getInfo() for the cl::Device to determine whether your GPU supports this feature.