Silent __private memory size limit?

March 23, 2014, 10:11 am

≫ Next: Is possible to compile OpenCL code for all current video cards?

Hi,

our application launches few OpenCL kernels in a loop, each iteration waiting for the previous one to complete (clFinish). One of the kernels is quite complex and uses nearly 18 kB of private memory per work item. We had very hard time making it work on AMD platform (no significant problems with nVidia or Intel). The application ran OK for few iterations of the loop and then suddenly enqueuing of the complex kernel started returning "out of resources" error. Compilation and first enqueue calls were all OK. Finally we tried replacing the __private memory buffers with pieces of __global buffer for each work item (reducing __private usage to about 3 kB per work item) and it started working even on AMD.

My question: Is there any private memory size limit? I'd like to know whether we have fixed the issue in our code (reduced private memory usage) or only fixed one of side effects of some bug which is still there.

All of this was happening on Ubuntu linux (12.04) with following driver:

[	6.750882] <6>[fglrx] module loaded - fglrx 13.35.5 [Mar 12 2014] with 1 minors

When we tried with Windows 7, the graphics driver always crashed.

Thanks,

Martin Jirman

↧

Is possible to compile OpenCL code for all current video cards?

March 24, 2014, 1:38 am

≫ Next: OpenCL Image2D Format

≪ Previous: Silent __private memory size limit?

I need to distribute my program, which use opencl code without opencl sources. So i need to compile it to binary and load program from binary. This is simple, but each video card compile different binary for it self. How can i avoid this, and compile my source code for different videocards, whitout having this videocards. Maybe some compiler exist, where i can set spesific video card and get binary code for it? Thanks!

↧

OpenCL Image2D Format

March 23, 2014, 6:35 pm

≫ Next: Calculation error on GPU only

≪ Previous: Is possible to compile OpenCL code for all current video cards?

Hello Guys,

I am starting with OpenCL to image processing... So I am starting to learn the basics of Image objects and how to upload image data to an OpenCL device.

I read the specification, even bought a book ( OpenCL in Action ) and I am doing fine until now.

But... I am facing a problem, ( I think its a problem ) about image data format.

To make my project some kind of portable I am using ANSI C and OpenCL ( to processing ) + SDL to GUI and image loading functions.

SDL load image functions give me an byte array of pixel data... the format is the same of a simple BMP 3Bytes per pixel data ( 1byte to BLUE, GREEN and RED ).

To create an image2d ( clCreateImage2d ) I got the following method declaration:

`cl_memclCreateImage2D(`	cl_context`context`,
	cl_mem_flags`flags`,
	constcl_image_format`*image_format`,
	size_t`image_width`,
	size_t`image_height`,
	size_t`image_row_pitch`,
	void`*host_ptr`,
	cl_int`*errcode_ret)`

TheimageFormatargument is intriguing me... (cl_image_format). The docs say I can use a "CL_BGRA" format.. so I must put an extra byte for each pixel in my buffer right? ( today my buffer is [b,g,r][b,g,r]... I will add an extra byte so it will become [b,g,r,1][b,g,r,1] )... its the right approach?

What really intrigues me is the second field of cl_image_format. (cl_channel_type image_channel_data_type) the cl_channel_type enumeration says I can use CL_UNSIGNED_INT8 as data type... It is messing with my mind...

The docs says ( about CL_UNSIGNED_INT8 ):Each channel component is an unnormalized unsigned 8-bit integer value.

In myhost program,my byte array is made of"unsigned int"( 4byte each )... so If I send it to a clCreateImage2D using theCL_UNSIGNED_INT8as acl_channel_typein mycl_image_formatparameter it will work? OpenCL will convert my 4byte info to a 8byte info?

or may I convert my byte buffer from integer to long/double values?

What I am missing? I think it maybe simpler... but I am missing something I cant see... May someone give-me a hand?

↧

Calculation error on GPU only

March 24, 2014, 1:15 pm

≫ Next: Kernel Compilation "LLVM ERROR"

≪ Previous: OpenCL Image2D Format

Hi,

as updates to older threads don't seem to receive a lot of attention, I'm creating this new one. I'm asking to head over to the original thread and help me find a workaround or solution to this "behavior" which I'd call an AMD-OCL-compiler bug (until proven otherwise ;-).

Thanks

↧

Kernel Compilation "LLVM ERROR"

March 17, 2014, 10:08 am

≫ Next: OpenCL FFT implementation

≪ Previous: Calculation error on GPU only

Here is the OpenCL (I've marked the statements that seem to cause the issue - lines 8 and 21):

(If I were to change tempint on those lines to any literal uint the kernel compiles fine - madness)

uint wide_add_vector(uint* res, const uint* a, const uint* b)
{  ulong carry=0;  #pragma unroll    for(uint i=0;i<4;i++){  ulong tmp=(ulong)(a[i])+b[i]+carry;  uint tempint = (uint)(tmp&0xFFFFFFFF);  res[i] = tempint; // <---- Problem statement  carry=tmp>>32;  }  return carry;
}


uint wide_add_scalar(uint* res, const uint* a, uint b)
{
  ulong carry=b;  #pragma unroll    for(uint i=0;i<4;i++){  ulong tmp=a[i]+carry;  uint tempint = (uint)(tmp&0xFFFFFFFF);  res[i] = tempint; // <---- Problem statement  carry=tmp>>32;  }  return carry;
}


void wide_mul(uint* res_hi, uint* res_lo, const uint* a, const uint* b)
{


  ulong carry=0, acc=0;  #pragma unroll    for(uint i=0; i<4; i++){  #pragma unroll        for(uint j=0; j<=i; j++){  ulong tmp=(ulong)(a[j])*b[i-j];  acc+=tmp;            carry+=(acc < tmp);  }  res_lo[i]=(uint)(acc&0xFFFFFFFF);  acc= (carry<<32) | (acc>>32);  carry=carry>>32;  }  #pragma unroll    for(uint i=1; i<4; i++){  #pragma unroll        for(uint j=i; j<4; j++){  ulong tmp=(ulong)(a[j])*b[4-j+i-1];  acc+=tmp;            carry+=(acc < tmp);  }  res_hi[i-1]=(uint)(acc&0xFFFFFFFF);  acc= (carry<<32) | (acc>>32);  carry=carry>>32;  }  res_hi[3]=acc;
}


void wide_copy_global(__global uint *res, const uint *a)
{
  #pragma unroll    for(uint i=0;i<8;i++){  res[i]=a[i];  }
}


__kernel void bitecoin_miner(ulong roundId,ulong roundSalt,ulong chainHash, uint4 c, uint hashSteps, __global uint* proofBuffer)
{
    uint workerID = get_global_id(0);        uint cArray[4] = {c.x,c.y,c.z,c.w};        uint x[8] = {workerID,0,(uint)roundId,(uint)roundId,(uint)roundSalt,(uint)roundSalt,(uint)chainHash,(uint)chainHash};        for(uint j=0;j<hashSteps;j++)    {        uint tmp[8];                wide_mul(tmp+4, tmp, x, cArray); // cArray; not to be confused with carry.                uint carry=wide_add_vector(x, tmp, x+4);                wide_add_scalar(x+4, tmp+4, carry);    }        wide_copy_global(proofBuffer+8*workerID,x);
}

When run I get:

LogLevel = 2 -> 2
[MyClient], 1395075385.62, 2, Created log.
Will try to connect to address Minty at port 4000
Found 1 platforms  Platform 0 : Advanced Micro Devices, Inc.
Choosing platform 0
Found 2 devices  Device 0 : Tahiti  Device 1 : Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Choosing device 0
LLVM ERROR: Cannot select: 0x855acbc3a0: i32 = setcc 0x855acbcca0, 0x855ac3a080, 0x855ac3a480 [ORD=52] [ID=30]  0x855acbcca0: i64 = add 0x855ac3a080, 0x855ac3aa80 [ORD=49] [ID=28]    0x855ac3a080: i64,ch = CopyFromReg 0x855ac2b1d0, 0x855ac3a680 [ORD=49] [ID=19]      0x855ac3a680: i64 = Register %vreg33 [ORD=49] [ID=7]    0x855ac3aa80: i64 = mul 0x855acbcda0, 0x855ac37450 [ORD=48] [ID=27]      0x855acbcda0: i64,ch = load 0x855ac2b1d0, 0x855ac37250, 0x855ac3a380<LD4[%scevgep106], zext from i32> [ORD=47] [ID=26]        0x855ac37250: i32 = add 0x855ac36640, 0x855ac38960 [ORD=45] [ID=25]          0x855ac36640: i32 = sub 0x855ac37850, 0x855ac37050 [ORD=44] [ID=24]            0x855ac37850: i32 = FrameIndex<0> [ORD=41] [ID=1]            0x855ac37050: i32 = shl 0x855acbbc90, 0x855ac3a980 [ORD=44] [ID=23]              0x855acbbc90: i32,ch = CopyFromReg 0x855ac2b1d0, 0x855ac36940 [ORD=43] [ID=18]                0x855ac36940: i32 = Register %vreg30 [ORD=43] [ID=3]              0x855ac3a980: i32 = Constant<2> [ORD=44] [ID=4]          0x855ac38960: i32 = Constant<8> [ORD=45] [ID=5]        0x855ac3a380: i32 = undef [ORD=46] [ID=6]      0x855ac37450: i64 = zero_extend 0x855acbbd90 [ORD=42] [ID=21]        0x855acbbd90: i32,ch = CopyFromReg 0x855ac2b1d0, 0x855acbba90 [ORD=42] [ID=17]          0x855acbba90: i32 = Register %vreg31 [ORD=42] [ID=2]  0x855ac3a080: i64,ch = CopyFromReg 0x855ac2b1d0, 0x855ac3a680 [ORD=49] [ID=19]    0x855ac3a680: i64 = Register %vreg33 [ORD=49] [ID=7]
In function: __OpenCL_bitecoin_miner_kernel
Press any key to continue . . .

If I put it into Kernel Analyzer it just freezes.

Any ideas?

The system is:

Windows 8.1 64-bit, Visual Studio 2013

HD7970 Driver Version 13.350.1005.0

Catalyst 14.2

AMD APP SDK 2.9

Many Thanks

Henry

↧

OpenCL FFT implementation

March 25, 2014, 2:09 am

≫ Next: x264 demo not working on AMD platform

≪ Previous: Kernel Compilation "LLVM ERROR"

Hi all,

I`m trying to understand the following article related to subject:

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-fast-fourier-transform-part-ii/

namely FFT_64 kernel. Author says:

"The above shown listing begins with a function map_id() that computes the relative memory offsets within each workgroup."

But how exactly should this function look like? I guess it should map instances to avoid bank conflicts but have no idea how exactly should this function be implemented. Can someone help me with that?

Thanks,