An intricate bug has been found in the latest Catalyst drivers when running OpenCL kernels with constant indexes accessing the __local memory.
Conditions: The bug only emerges in each group when get_local_id(0) == 0 and get_local_id(1) == 0 or 14 in a (16, 16, 1) dimension grid.
Example:
__local float s_srcPatch[10][10];
__local float s_dstPatch[20][16];
float sum;
sum = (evenFlag * 0.0625f) * s_srcPatch[0][1 + ((tidx - 2) / 2)];
sum = sum + ( oddFlag * 0.25f ) * s_srcPatch[0][1 + ((tidx - 1) / 2)];
sum = sum + (evenFlag * 0.375f ) * s_srcPatch[0][1 + ((tidx ) / 2)];
sum = sum + ( oddFlag * 0.25f ) * s_srcPatch[0][1 + ((tidx + 1) / 2)];
sum = sum + (evenFlag * 0.0625f) * s_srcPatch[0][1 + ((tidx + 2) / 2)]; // Operation 5
s_dstPatch = sum;
In the case above, the operation 5 will be ignored. The final sum is incorrect.
__local float s_srcPatch[10][10];
__local float s_dstPatch[20][16];
float sum;
sum = (evenFlag * 0.0625f) * s_srcPatch[get_local_size(1) - 16][1 + ((tidx - 2) / 2)];
sum = sum + ( oddFlag * 0.25f ) * s_srcPatch[get_local_size(1) - 16][1 + ((tidx - 1) / 2)];
sum = sum + (evenFlag * 0.375f ) * s_srcPatch[get_local_size(1) - 16][1 + ((tidx ) / 2)];
sum = sum + ( oddFlag * 0.25f ) * s_srcPatch[get_local_size(1) - 16][1 + ((tidx + 1) / 2)];
sum = sum + (evenFlag * 0.0625f) * s_srcPatch[get_local_size(1) - 16][1 + ((tidx + 2) / 2)]; // get_local_size(1) == 16 which take the same effect as above code segment
s_dstPatch = sum;
If I use an expression "get_local_size(1) - 16", everything goes fine. I can get the correct sum.