Hello All,
Disclaimer: I am new to GPU programming (started on Tuesday but not to parrallel programing or SIMD programming (Connection Machine 16K 1 bit processors running in SIMD fashion... I am dating myself here ;-) .
Also I did read the white paper 2620_final and the southern island ISA plus any other papers/presentations I could put my hands on...
I have a some questions regarding the way instructions are schedule on a GCN CU.
What I have understood so far:
A CU has 4 vector units 1 scalar and 1 LDS and emit one instruction per cycles to each but
vector unit i consume 4 times the same instruction from wavefront wf-ai (4 cycles) (i in {0,1,2,3})
The scalar unit get one instruction per cycle from 4 different wf-b0, wf-b1, wf-b2 and wf-b3. So seen from one vector unit one scalar instructions could be executed per 4 cycles if it belong to a wavefront not running on the vector unit.
Question 1: Could a wavefront with many successive scalar instructions (and no other wavefront in the CU in position to execute a scalar instruction)
run more than one scalar per 4 cycles group? (I will guess not, if the scalar unit as the same needs to hide pipeline latency than the vector units.)
Question 2: Could a LDS instruction and a scalar instruction belonging to the same wavefront run in the same time 4 cycles group?
Same question for LDS and Vector instructions of the same wavefront.
Question 3; Wavefront priority: does the priority affect LDS and scalar instruction dispatch over the whole CU or just on a per vector unit basis?
Question 5: On a Vector unit with two wavefront executing on it (same priority) could it be assumed that they will alternate every 4 cycles (no GDS or LDS pending execution or conflict)?
Questions 6: 64 bits instructions on the vector unit take 4 cycles (16 cycles for a wavefront) I understand it as a pure stall for that Vector unit but could 4 or more Scalar and LDS instructions belonging to other wavefront running on the same vector unit been executed during those cycles?
Voila, I guess that will be it for now...
Thanks,
Eric L.
PS: The targeted app is of symbolic nature (marginal use of floating point) and it his latency sensible.