I honestly forgot about it, but a couple of weeks ago realhet informed me shuffle is there for GCN at ISA level.
I could recall something about it and after a while I recalled a few pictures in GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah.
I see broadcast in CL2 by work_group_broadcast and I can see why this is easier to specify than the rest but...
Full 4-lane xbar? Yes.
I would be happy to see this as extension so to bypass the lengthy scrutiny to core spec. Say an CL_AMD_GCN_SIMD_SHUFFLE, guaranteed to work only with work group size 64. It would be enough for me. Is there any hope?