Quantcast
Channel: Community : All Content - OpenCL
Viewing all articles
Browse latest Browse all 2400

low valu utilization without memory accesses

$
0
0

good morning.

 

thanks for listening. here's the riddle:

 

i have written an opencl kernel, which is 99% alu bound, using only bit operation instructions,

(v_xor, v_not, v_or, v_bfi)

no access to memory (including lds). uses only v_ instructions, with the exception of s_cmp, s_add,

s_branch for the for loop. there are 100 alu instructions in the loop body, but the measurements

(valu utilization aka VALUBusy) do not change if there are 1000.

the loop is executed 10000 times per kernel invocation.

there are minimal read after write conflicts and all read after write accesses to registers happen in

non-adjacent instructions.

the loop body is a mix of ~70 VOP3 (v_bfi) instructions and ~30 VOP2 (and, or, xor).

the whole kernel program size is a mere 3k (fits instruction cache).

there is no thread divergance.

i cannot determine if the 8 wavefronts per CU are executed concurrently or sequentially (4 then 4)

(because i dont see how i can access s_memtime from opencl)

 

yet: sprofile only measures VALUBusy of 60%. doing the math myself (number of instructions vs kernel time)

i come to the same conclusion.

 

i am curious, how does a kernel with more than 90% VALUBusy look like. any examples.


Viewing all articles
Browse latest Browse all 2400

Latest Images

Trending Articles



Latest Images