mad_hi(uint) slower than mul

Hi,

I'm working with 96-bit unsigned integers by using 3x32-bit uints, several of them in parallel in a vector:

#define CON2(a,b) a##b

#define CONC(a,b) CON2(a,b)

#define VECTOR_SIZE 2

// #define VECTOR_SIZE 4

#define AS_UINT_V CONC(as_uint, VECTOR_SIZE)

typedef struct _int96_t

{

CONC(uint, VECTOR_SIZE) d0,d1,d2; // e.g. uint2 d0,d1,d2;

}int96_t;

Now I have a function that calculates the lower half of the product of two of such 96-bit vectors:

void mul_96(int96_t * const res, const int96_t a, const int96_t b)

/* res = a * b */

{

__private uint_v tmp;

res->d0 = a.d0 * b.d0;

res->d1 = mul_hi(a.d0, b.d0);

res->d2 = mul_hi(a.d1, b.d0);

tmp = a.d1 * b.d0;

res->d1 += tmp;

res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry

res->d2 += mul_hi(a.d0, b.d1);

tmp = a.d0 * b.d1;

res->d1 += tmp;

res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry

res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;

}

In order to optimize performance, I tried to use mad_hi instead of mul_hi and an addition:

void mul_96(int96_t * const res, const int96_t a, const int96_t b)

/* res = a * b */

{

__private uint_v tmp;

res->d0 = a.d0 * b.d0;

tmp = a.d1 * b.d0;

res->d1 = mad_hi(a.d0, b.d0, tmp);

res->d2 = mad_hi(a.d1, b.d0, AS_UINT_V((tmp > res->d1)? 1 : 0));

res->d2 = mad_hi(a.d0, b.d1, res->d2);

tmp = a.d0 * b.d1;

res->d1 += tmp;

res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0);

res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;

}

However, this second function is between 2 and 15% slower in various kernels, probably depending on the code surrounding the function call. This is in Catalyst 12.6, HD5770, on both Win64 and Linux64. I just updated to 12.8, but this makes no difference.

I already found out that mad_hi will be translated (for example) to

	33 t: MULHI_UINT ____, R10.x, R2.w
	34 x: ADD_INT	____, T0.z, PS33

But then it should be the same as if I used mul_hi and + myself ??? Why is it so much slower? In other places that are executed just once per kernel, I noticed big differences as well, somtimes making it a bit faster, sometimse much slower than mul_hi plus addition.

Under which conditions would it use native mad_hi instructions?

Also, I have rather bad ALU Packing (~75%), coming from loads of MULLO_INT and MULHI_UINT that only run in the t-unit, leaving x-w empty. Can anyone suggest how to improve that generally?

Thanks a lot,

Bdot

mad_hi(uint) slower than mul_hi + addition

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List