Hello to everyone,
I am currently trying to get familiar with jocl, and learn the basics.
For that I tried a basic Sample, in which I fill a array representing an Image with shades of blue.
So that every Work-Item has its own intensity value of the blue component.
Here´s the example:
__kernel void sampleKernel(__global float *intensitys, __global float *picture)
{
int gid = get_global_id(0);
int width = 1800;
int height = 1000;
for(int j = 0; j < 2000; j ++){
int position = (height - gid - 1) * width;
for(int i = 0; i < width; i++){
picture[position+i] = 255 * intensitys[gid];
}
}
}
I added the 2000-loop only for more computation time, so that I can benchmark it better. It has no influence on the final image.
My problem is that the execution time on the GPU is longer than on the CPU
I use global_work_size of 1000 for every line of the Image
local_work_size 64 for GPU executiontime: 540ms
local_work_size 4 for CPU executiontime: 387ms
I tried several local_work_size´s but the GPU was always slower.
I thought it could be the IO between GPU and CPU but removing the 2000-loop results in nearly 0ms computation times for
both GPU and CPU.
Doubling the loop to 4000 results in double computation time so the IO has no big influence on the computation time
I realy don´t know why, the GPU should with it´s 1000 shaders perform much better than the CPU with its 4 cores.
I appreciate every hint. Thanks for your help in advance!
The code is in the appendix