Hi,
I'm working with 96-bit unsigned integers by using 3x32-bit uints, several of them in parallel in a vector:
#define CON2(a,b) a##b
#define CONC(a,b) CON2(a,b)
#define VECTOR_SIZE 2
// #define VECTOR_SIZE 4
#define AS_UINT_V CONC(as_uint, VECTOR_SIZE)
typedef struct _int96_t
{
CONC(uint, VECTOR_SIZE) d0,d1,d2; // e.g. uint2 d0,d1,d2;
}int96_t;
Now I have a function that calculates the lower half of the product of two of such 96-bit vectors:
void mul_96(int96_t * const res, const int96_t a, const int96_t b)
/* res = a * b */
{
__private uint_v tmp;
res->d0 = a.d0 * b.d0;
res->d1 = mul_hi(a.d0, b.d0);
res->d2 = mul_hi(a.d1, b.d0);
tmp = a.d1 * b.d0;
res->d1 += tmp;
res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry
res->d2 += mul_hi(a.d0, b.d1);
tmp = a.d0 * b.d1;
res->d1 += tmp;
res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry
res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;
}
In order to optimize performance, I tried to use mad_hi instead of mul_hi and an addition:
void mul_96(int96_t * const res, const int96_t a, const int96_t b)
/* res = a * b */
{
__private uint_v tmp;
res->d0 = a.d0 * b.d0;
tmp = a.d1 * b.d0;
res->d1 = mad_hi(a.d0, b.d0, tmp);
res->d2 = mad_hi(a.d1, b.d0, AS_UINT_V((tmp > res->d1)? 1 : 0));
res->d2 = mad_hi(a.d0, b.d1, res->d2);
tmp = a.d0 * b.d1;
res->d1 += tmp;
res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0);
res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;
}
However, this second function is between 2 and 15% slower in various kernels, probably depending on the code surrounding the function call. This is in Catalyst 12.6, HD5770, on both Win64 and Linux64. I just updated to 12.8, but this makes no difference.
I already found out that mad_hi will be translated (for example) to
33 t: MULHI_UINT ____, R10.x, R2.w | |||
34 x: ADD_INT | ____, T0.z, PS33 |
But then it should be the same as if I used mul_hi and + myself ??? Why is it so much slower? In other places that are executed just once per kernel, I noticed big differences as well, somtimes making it a bit faster, sometimse much slower than mul_hi plus addition.
Under which conditions would it use native mad_hi instructions?
Also, I have rather bad ALU Packing (~75%), coming from loads of MULLO_INT and MULHI_UINT that only run in the t-unit, leaving x-w empty. Can anyone suggest how to improve that generally?
Thanks a lot,
Bdot