x264 demo not working on AMD platform

March 25, 2014, 11:04 am

≫ Next: OpenCL Samples Visual Studio Compile Errors

≪ Previous: Different behaviors when device has reached its maximum global memory limit

Hi,

I have downloaded a OpenCL enabled x264 encoder from Universität Heidelberg , compilation was fine on AMD platform but it failed to run on Ubuntu_13.10 + HD7970, segmentation fault.

Sharing the compiled binaries here:

Command to run the demo: # ./x264 --threads 8 -A none --no-cabac --no-deblock --subme 0 --me dia --qp 16 --output out.264 result.y4m

change x264 executable permission if required.

Can some one help me on it?

↧

OpenCL Samples Visual Studio Compile Errors

March 26, 2014, 1:21 pm

≫ Next: PCIE Performance

≪ Previous: x264 demo not working on AMD platform

I just downloaded the the AMD OpenCL SDK and tried compiling one of the samples in Visual Studio 2010. These are the errors I get:

Error 1 error C1083: Cannot open include file: 'CL/cl.hpp': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\samples\opencl\cl\DynamicOpenCLDetection\VectorAddition\VectorAddition.cpp 20 1 VectorAddition
Error 2 error C1083: Cannot open include file: 'CL/cl.h': No such file or directory c:\users\watkinsp\amd app sdk\2.9\samples\opencl\cl\template\Template.hpp 24 1 Template
Error 3 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 BufferImageInterop
Error 4 error LNK1104: cannot open file 'OpenCL.lib' C:\Users\watkinsp\AMD APP SDK\2.9\samples\opencl\cl\DynamicOpenCLDetection\LINK DynamicOpenCLDetection
Error 5 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 TransferOverlap
Error 6 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 URNG
Error 7 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 SobelFilter
Error 8 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 SimpleImage

I haven't modified anything. I guess I just assumed I could open the provided Visual Studio 2010 solution and compile one of the samples out of the box. Did I miss an installation step?

↧

PCIE Performance

December 26, 2013, 9:10 am

≫ Next: OpenGL-OpenCL Interoperability

≪ Previous: OpenCL Samples Visual Studio Compile Errors

Can't understand: "bufferbandwidth -pcie" shows me 12GBPS in both ways, "bufferbandwidth -if 6" - only 6GBPS in memset. Why? Haswell i7, Z87, Tahiti.

↧

OpenGL-OpenCL Interoperability

April 1, 2014, 10:59 pm

≫ Next: Calculation error on GPU only

≪ Previous: PCIE Performance

Dear AMD,

Could you clarify, please, the right way to access GL-CL interop. in this system:

Windows environment (Windows 7 x64).

One or several AMD GPUs.

Primary display can be not an AMD one (integrated - Intel or something else).

There should be possible to start application over RDP.

My goal is to create interconnected GL-CL contexts for each GPU to build some geometry (OpenCL) and render it into FBO (OpenGL). There is no need to show anything (except console messages - no problem here).

The problem is that while spending a huge amount of time, playing with OpenGL, OpenCL, Win32, I found no way to do it properly. Is the only chance to create these contexts - to make fake windows all over the infinitely stretched desktop with dummy plug in each of the card to keep them switched on? And still I'm not pretty sure, that it will work in all cases (RDP) of aforementioned environment.

So, what I know for now:

1. GL context cannot be created from CL one - this is the most disappointing fact;

2. wgl_amd_gpu_association - it needs one window-related GL context to generate others, but the main problem is that these others are useless(CL_INVALID_SHAREGROUP_REFERENCE_KHR) for CL context creation - I don't know the reason, but this is the second disappointing fact.

3. Last chance - try to access windows display devices, find an AMD one and use CreateDC instead of GetDC(window handle). Works fine with AMD as a primary display. As a secondary - access violation exception in ChoosePixelFormat and no GL context as a result.

P. S. By the way, please, look into SimpleGL sample (and AMD APP OpenCL Programming Guide, Appendix G - 7.1.1.1, step 8) - what ZeroMemory(&pfd,... there for ? After pfd initialization? And it works somehow!

Please, help.

↧

Calculation error on GPU only

March 24, 2014, 1:15 pm

≫ Next: Cross-platform OpenCL APU Development.

≪ Previous: OpenGL-OpenCL Interoperability

Hi,

as updates to older threads don't seem to receive a lot of attention, I'm creating this new one. I'm asking to head over to the original thread and help me find a workaround or solution to this "behavior" which I'd call an AMD-OCL-compiler bug (until proven otherwise ;-).

Thanks

↧

Cross-platform OpenCL APU Development.

April 3, 2014, 1:37 am

≫ Next: Hello world does not compile

≪ Previous: Calculation error on GPU only

I've looked at some comparisons betwen C++ AMP and OpenCL, and OpenCL is what I want to develop for, for many reasons. Mainly it being cross-platform and not as hardware specific as AMP or CUDA. It also well written code seems to perform better than it does on AMP, while CUDA is a non-starter being Nvidia only.

Anyway, my question would be how to properly develop/release OpenCL applications. I noticed there are Intel, AMD, ARM, and Nvidia SDKs. For now, I'm mostly concerned with developing for AMD and Intel APUs, I'm not much interested at all in GPUs. Mostly because the type of programs I would write would require very good shrared memory performance, and APUs have it even if they don't have as fast or as many cores. I believe the APU will be the true successor to the math co-processor that we have had since the days of the 386, what we call an FPU today.

Anyway, my concern is, what is the difference between the AMD and Intel OpenCL SDKs? Will the same OpenCL code (minus specific optimizations) run on both SDKs? I'm planning to compile my OpenCL applications as DLLs seperately, and call them from a VC++ applicatoin after profiling the target's system. The idea would be to use a binary produced by the AMD SDK for an AMD CPU/APU and an the Intel SDK for an Intel CPU/APU. The two "projects" would share much of the same source-code, while having differences in their headers / kernel classes. I'm just wanting to confirm that this approach would be reasonable or if it has been documented.

I would also be very much interested in any books/literature on this topic. If anyone has anything else I should be aware of please, let me know. Thank you.

↧

Hello world does not compile

April 2, 2014, 10:40 pm

≫ Next: Does M290X support fully VCE like desktop graphics do?

≪ Previous: Cross-platform OpenCL APU Development.

Hello,

I am following this "Hello world" tutorial and it won't compile. It is returning an unknown pragma warning on this command which seems to make the compile fail:

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

I am compiling through the "Developer Command Prompt for VS2013" using this command and I have copied the OpenCL.lib file into my local folder.

cl /Fe"hello_world.exe" /I"C:\Program Files (x86)\AMD APP SDK\2

.9\include" lesson.cpp lib/OpenCL.lib

I have copied the file I'm using verbatim below in case I misunderstood something from the tutorial. I have the complete APP SDK 2.9, an AMD A10 APU, and recent drivers. I don't know why it's having issues unless the extension is no longer supported. This is the closest I've seen to a list of supported extensions. I'm not sure what the problem is. I've included the complete compiler output at the bottom.

Thank you if anyone can show me why it isn't working.

lesson.cpp:

"""

#include <utility>

#define __NO_STD_VECTOR // Use cl::vector instead of STL version

#include <CL/cl.hpp>

#include <cstdio>

#include <cstdlib>

#include <fstream>

#include <iostream>

#include <string>

#include <iterator>

const std::string hw("Hello World\n");

inline void checkErr(cl_int err, const char * name) {

if (err != CL_SUCCESS) {

std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl;

exit(EXIT_FAILURE);

}

int main(void)

{

cl_int err;

cl::vector< cl::Platform > platformList;

cl::Platform::get(&platformList);

checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get");

std::cerr << "Platform number is: " << platformList.size() << std::endl;

std::string platformVendor;

platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, &platformVendor);

std::cerr << "Platform is by: " << platformVendor << "\n";

cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0};

cl::Context context(CL_DEVICE_TYPE_CPU, cprops, NULL, NULL, &err);

checkErr(err, "Conext::Context()");

char * outH = new char[hw.length()+1];

cl::Buffer outCL(context,

CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,

hw.length()+1,

outH,

&err);

checkErr(err, "Buffer::Buffer()");

cl::vector<cl::Device> devices;

devices = context.getInfo<CL_CONTEXT_DEVICES>();

checkErr(devices.size() > 0 ? CL_SUCCESS : -1, "devices.size() > 0");

std::ifstream file("lesson1_kernels.cl");

checkErr(file.is_open() ? CL_SUCCESS:-1, "lesson1_kernel.cl");

std::string prog(std::istreambuf_iterator<char>(file),

(std::istreambuf_iterator<char>()));

cl::Program::Sources source(1,

std::make_pair(prog.c_str(), prog.length()+1));

cl::Program program(context, source);

err = program.build(devices,"");

checkErr(err, "Program::build()");

cl::Kernel kernel(program, "hello", &err);

checkErr(err, "Kernel::Kernel()");err = kernel.setArg(0, outCL);

checkErr(err, "Kernel::setArg()");

cl::CommandQueue queue(context, devices[0], 0, &err);

checkErr(err, "CommandQueue::CommandQueue()");cl::Event event;

err = queue.enqueueNDRangeKernel(kernel,

cl::NullRange,

cl::NDRange(hw.length()+1),

cl::NDRange(1, 1),

NULL,

&event);

checkErr(err, "ComamndQueue::enqueueNDRangeKernel()");

event.wait();

err = queue.enqueueReadBuffer(outCL,

CL_TRUE,

hw.length()+1,

outH);

checkErr(err, "ComamndQueue::enqueueReadBuffer()");

std::cout << outH;

return EXIT_SUCCESS;

}

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

__constant char hw[] = "Hello World\n";

__kernel void hello(__global char * out)

{

size_t tid = get_global_id(0);

out[tid] = hw[tid];

}

"""

Complete compiler output:

"""

D:\Programs\openCL>cl /Fe"hello_world.exe" /I"C:\Program Files (x86)\AMD APP SDK

\2.9\include" lesson.cpp lib/OpenCL.lib

Microsoft (R) C/C++ Optimizing Compiler Version 18.00.21005.1 for x86

lesson.cpp

C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\INCLUDE\xlocale(337) : wa

rning C4530: C++ exception handler used, but unwind semantics are not enabled. S

pecify /EHsc

lesson.cpp(81) : warning C4068: unknown pragma

lesson.cpp(82) : error C2144: syntax error : 'char' should be preceded by ';'

lesson.cpp(82) : error C4430: missing type specifier - int assumed. Note: C++ do

es not support default-int

lesson.cpp(82) : error C2373: 'hw' : redefinition; different type modifiers

lesson.cpp(11) : see declaration of 'hw'

lesson.cpp(83) : error C2144: syntax error : 'void' should be preceded by ';'

lesson.cpp(83) : error C4430: missing type specifier - int assumed. Note: C++ do

es not support default-int

lesson.cpp(83) : error C2065: '__global' : undeclared identifier

lesson.cpp(83) : error C2144: syntax error : 'char' should be preceded by ')'

lesson.cpp(83) : error C2448: 'hello' : function-style initializer appears to be

a function definition

lesson.cpp(83) : error C2059: syntax error : ')'

lesson.cpp(85) : error C2144: syntax error : 'size_t' should be preceded by '}'

lesson.cpp(85) : error C2144: syntax error : 'size_t' should be preceded by ';'

lesson.cpp(85) : error C3861: 'get_global_id': identifier not found

lesson.cpp(86) : error C2057: expected constant expression

lesson.cpp(86) : error C2466: cannot allocate an array of constant size 0

lesson.cpp(86) : error C4430: missing type specifier - int assumed. Note: C++ do

es not support default-int

lesson.cpp(86) : error C2372: 'out' : redefinition; different types of indirecti

lesson.cpp(83) : see declaration of 'out'

lesson.cpp(86) : error C2088: '[' : illegal for class

lesson.cpp(87) : error C2059: syntax error : '}'

lesson.cpp(87) : error C2143: syntax error : missing ';' before '}'

"""

↧

Does M290X support fully VCE like desktop graphics do?

April 7, 2014, 4:52 am

≫ Next: Scratch registers - how to prevent their usage?

≪ Previous: Hello world does not compile

Hi,

Before I will order a laptop with M290X I would like to confirm that VCE is supported by this card.

on AMD website there is a nice table of features for desktop R9 series with clear Video Codec Engine (VCE) (with H.264, MPEG-4 ASP, MPEG-2, VC-1 & Blu-ray 3D) support.

For mobile R9 there is no information. I hope you can provide me a valueable info about this.

Thanks.

Regards

Pawel.

↧

Scratch registers - how to prevent their usage?

April 10, 2014, 6:56 am

≫ Next: clFFT performance evaluation

≪ Previous: Does M290X support fully VCE like desktop graphics do?

I use fixed-size array in registers to reduce fetch size required by kernel.

At some size (11 elements) kernel performance dropped considerably (3 times slowdown) and 22 scratch registers were used.

Kernel occupancy is 25% that corresponds to 8 waves per CU.

That is, instead of using only single workgroup of 4 waves and no scratch registers compiler decided to keep 8 waves (2 workgroups) per CU but introduce 22 scratch registers.

Cause performance dropped greatly it's obviously bad choice.

At array size of 10 there are no scratch registers at all, 8 waves and 3 1VGPR used (I profiling kernel on Loveland GPU).

At array size of 11 there are 22 scratch registers, 31 VGPR and 8 waves too.

Is it possible to tell compiler somehow not to use scratch registers and decrease number of waves in fly instead?

I expect much better performance with more register space used per workitem even if number of waves in flight will be decreased to only 4.

Here is ISA for length of 10:

; -------- Disassembly -------------------- 00 ALU_PUSH_BEFORE: ADDR(64) C - Pastebin.com

And here for len of 11:

; -------- Disassembly -------------------- 00 ALU_PUSH_BEFORE: ADDR(224 - Pastebin.com

↧

clFFT performance evaluation

April 10, 2014, 5:38 am

≫ Next: Am having an assertion error arising from clBuildProgram in my code on Visual Studio 2012

≪ Previous: Scratch registers - how to prevent their usage?

I have been working on performance evaluation of the clFFT library on GPU, AMD Radeon R7 260x. The CPU is intel xeon inside and OS is centOS.

I have been studying the performance of 2D 16x16 clFFT with different batch modes (Parallel FFTs). I wondered to see the different results obtained from espicially event profiling and gettimeofday.

The results of 2D 16x16 clFFT with different batch modes are as following,

Using EventProfiling

batch kernel exec time(us) 1 320.7 16 461.1 256 458.3 512 537.7 1024 1016.8

Here, the batch represents the parallel FFTs and the kernel execution time represents the execution time in micro seconds.

Using gettimeofday

batch HtoD(us) kernelExecTime(us) DtoH(us) 1 29653 10850 39227 16 28313 10786 32474 256 26995 11167 39672 512 26145 10773 32273 1024 26856 11948 31060

Here, the batch represents the parallel FFTs, H to D represents data transfer time from host to device, the kernel exec time represents the kernel execution time and D to H represents the data transfer time from device to host and all are in micro seconds.

(I am sorry as I cant show you the results in good table format, I can not able to add tables here. hope you can read still). Here are my questions,

1a) Why the kernel times obtained from EventProfiling are completely different from that of gettimeofday?

1b) Here the anothor question is that, which results are correct?

2) The data (w.r.t size) transfers increases as the batch size increases. Bur from the resluts of the gettimeofday, the data transfer times either the H to D or D to H are almost constant instead of growing as the batch size increases from 1 to 1024. Why is that?

For a reference, a peace of code is given below,

clFinish( cl_queue); // Copy data from host to device gettimeofday(&t_s_gpu1, NULL); clEnqueueWriteBuffer( cl_queue, d_data, CL_TRUE, 0, width*height*batchSize*sizeof(cl_compl_flt), h_src, 0, NULL, &event1); clFinish( cl_queue); clWaitForEvents(1, &event1); gettimeofday(&t_e_gpu1, NULL); checkCL( clAmdFftBakePlan( fftPlan, 1, &cl_queue, NULL, NULL) ); clAmdFftSetPlanBatchSize( fftPlan, batchSize ); clFinish( cl_queue); gettimeofday(&t_s_gpu, NULL); checkCL( clAmdFftEnqueueTransform( fftPlan, CLFFT_FORWARD, 1, &cl_queue, 0, NULL, &event, &d_data, NULL, NULL) ); clFinish( cl_queue); clWaitForEvents(1, &event); gettimeofday(&t_e_gpu, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL); totaltime=totaltime+time_end - time_start; clFinish( cl_queue); // Copy result from device to host gettimeofday(&t_s_gpu2, NULL); checkCL( clEnqueueReadBuffer(cl_queue, d_data, CL_TRUE, 0, width*height*batchSize*sizeof(cl_compl_flt), h_res, 0, NULL, &event2)); clFinish( cl_queue); clWaitForEvents(1, &event2); gettimeofday(&t_e_gpu2, NULL);

I will be looking for your comments and answers and load of thanks in advance.

Best Regards,

Sreehari.

Message was edited by: Sreehari Ambuluri

↧

Am having an assertion error arising from clBuildProgram in my code on Visual Studio 2012

March 20, 2014, 12:31 pm

≫ Next: Concurrent OpenCL-kernel execution and Windows/Linux driver differences (Hawaii GPU)

≪ Previous: clFFT performance evaluation

I am trying to run a program that has three (3) implementations of raytracing algorithm (C++ Recursive version, C++ Iterative version and OpenCL version). The recursive and the iterative versions both run alright, my challenge is with the OpenCL version. I getting a lot of errors. After several modifications and tweaking am finally left with just a few errors one of which is caught when the clBuildProgram is called.

Perhaps am not entering the right parameters or am missing something minor. The thing is the whole project builds well with no errors only a few warnings.

I am hoping someone can point me in the right direction. Here are the specifications of the system am using:

CPU: AMD E-50 processor 1.6Ghz

OpenCL SDK : AMD APP 2.7

IDE: Visual Studio 2012

The error occurs on line 23 of the code excerpt below:



void OpenCLManager::MapKernelNameToProgram(std::string a_kernelName, std::string a_programName) {
  // Build CPU programs  {  cl_int err = 0;  std::string programName(a_programName);  std::string kernelName(a_kernelName);  programName += " OpenCL_CPU";  kernelName += " OpenCL_CPU";  cl_program program = m_MapProgramNameToProgram[programName];  if (!program)  {  // Load the source code to a string.  char* programSource = this->LoadProgramSource(a_programName.c_str());  // Create the program with source string.  program = clCreateProgramWithSource(m_ContextCPU, 1, (const char**)&programSource, NULL, &err);  assert(err == CL_SUCCESS);  // Build the program.  err = clBuildProgram(program, 1,&m_DeviceCPU, NULL, NULL, NULL);  char build[2048];  clGetProgramBuildInfo(program, m_DeviceCPU, CL_PROGRAM_BUILD_LOG, 2048, build, NULL);  printf("Build Log:\n%s\n",build); // Prints any build errors  assert(err == CL_SUCCESS);  m_MapProgramNameToProgram[programName] = program;  }  m_MapKernelNameToProgram[kernelName] = program;  }  // Build GPU programs  {  cl_int err = 0;  std::string programName(a_programName);  std::string kernelName(a_kernelName);  programName += " OpenCL_GPU";  kernelName += " OpenCL_GPU";  cl_program program = m_MapProgramNameToProgram[programName];  if (!program)  {  // Build the program, since it has not yet been built.  // Load the source code to a string.  char* programSource = this->LoadProgramSource(a_programName.c_str());  // Create the program with source string.  program = clCreateProgramWithSource(m_ContextGPU, 1, (const char**)&programSource, NULL, &err);  assert(err == CL_SUCCESS);  // Build the program.  //err = clBuildProgram(program, 1, &m_DeviceGPU, NULL, NULL, NULL);  char build[2048];  clGetProgramBuildInfo(program, m_DeviceGPU, CL_PROGRAM_BUILD_LOG, 2048, build, NULL);  printf("Build Log:\n%s\n",build); // Prints any build errors  assert(err == CL_SUCCESS);  m_MapProgramNameToProgram[programName] = program;  }  m_MapKernelNameToProgram[kernelName] = program;  }
}


char* OpenCLManager::LoadProgramSource(const char* filename) { 
  struct stat statbuf;  FILE *fh;  char *source;   fh = fopen(filename, "r");  if (fh == 0)  return 0;   stat(filename, &statbuf);  source = (char *) malloc(statbuf.st_size + 1);  fread(source, statbuf.st_size, 1, fh);  source[statbuf.st_size] = '\0';  return source; 
} 


//void OpenCLManager::InitContext( void ) {
// cl_int err = 0;
// // Create a context to perform our calculation with the 
// // specified device 
// m_ContextCPU = clCreateContext(0, 1, &m_DeviceCPU, NULL, NULL, &err);
// assert(err == CL_SUCCESS);
// m_ContextGPU = clCreateContext(0, 1, &m_DeviceGPU, NULL, NULL, &err);
// assert(err == CL_SUCCESS);
//
//}
//
//void OpenCLManager::InitCommandQueue() {
// m_CommandQueueCPU = clCreateCommandQueue(m_ContextCPU, m_DeviceCPU, 0, NULL);
// m_CommandQueueGPU = clCreateCommandQueue(m_ContextGPU, m_DeviceGPU, 0, NULL);
//}  // TODO: give some better way of notifying whether or not this was successful.
void OpenCLManager::InitDevice() {  // Init CPU device  m_ContextCPU= NULL;  m_ContextGPU= NULL;  m_CommandQueueCPU= NULL;  m_CommandQueueGPU= NULL;  m_DeviceCPU= NULL;  m_DeviceGPU= NULL;  cl_platform_id platform_id = NULL;  m_DeviceType= OpenCL_CPU;  m_raytraceKernelWorkGroupSize= NULL;  cl_uint ret_num_devices;  cl_uint ret_num_platforms;  cl_int err = 0;  /* Get Platform and Device Info */  err = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);  // Find the CPU CL device, as a fallback  err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_DEFAULT, 1, &m_DeviceCPU, &ret_num_devices);  std::cout << err << std::endl;  assert(err == CL_SUCCESS);  // Init GPU device  // Find the GPU CL device, this is what we really want  // If there is no GPU device is CL capable, fall back to CPU  err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &m_DeviceGPU, NULL);  if (err != CL_SUCCESS) m_DeviceGPU = m_DeviceCPU;  assert(m_DeviceGPU);  // Create a context to perform our calculation with the   // specified device   m_ContextCPU = clCreateContext(NULL, 1, &m_DeviceCPU, NULL, NULL, &err);  assert(err == CL_SUCCESS);  m_ContextGPU = clCreateContext(NULL, 1, &m_DeviceGPU, NULL, NULL, &err);  assert(err == CL_SUCCESS);  /* Create Command Queue */  m_CommandQueueCPU = clCreateCommandQueue(m_ContextCPU, m_DeviceCPU, 0, &err);  m_CommandQueueGPU = clCreateCommandQueue(m_ContextGPU, m_DeviceGPU, 0, &err);  // Construct the KernelName -> Program map.  this->MapKernelNameToProgram(RAYTRACE_KERNEL, RAY_TRACER_PROGRAM);




}

Please also find attached the visual studio solution. Thank you!

Please find attached the Visual studio solution

↧

Concurrent OpenCL-kernel execution and Windows/Linux driver differences (Hawaii GPU)

February 11, 2014, 6:59 pm

≫ Next: Finding minimum of square of difference between two arrays

≪ Previous: Am having an assertion error arising from clBuildProgram in my code on Visual Studio 2012

Hi,

I've written an OpenCL application which runs much faster in Linux if one calls kernels concurrently on R9 290(X) devices because only one of the three kernels has high register usage. The performance of GCN-based devices (excluding Hawaii) scales very well with CU count and GPU core frequency in Windows and Linux. That's my problem and my question:

Hawaii devices have a significant performance drop in Windows compared to Linux (about 1/3rd of the performance is lost) because it seems impossible to execute kernels in parallel on the device. Is this caused by different feature sets of the Windows and Linux Catalyst driver?
If a monitor is attached to the GPU and kernels are called concurrently in Windows, then high CPU usage is observed (any GCN-based device). The CPU does nothing else but calling OpenCL kernels and reading/writing a few bytes to device memory. Is this a driver bug?

with best regards,

NaN

↧

Finding minimum of square of difference between two arrays

March 30, 2014, 7:58 am

≫ Next: Why does increasing the number of kernel arguments impacts performance ?

≪ Previous: Concurrent OpenCL-kernel execution and Windows/Linux driver differences (Hawaii GPU)

Hi,

I have been trying to execute a simple kernel but it returns garbage values and I am unable to figure why. I want to find the closest set of planes from a given plane set using the angles between the planes. So, the criteria is to find the minimum of the square of the difference of the corresponding angles. In this case, the correct answer should be given as the planes which have near similar orientation. I am getting the desirable answer in CPU. But when I am sending it to kernel, it sends out a different answer not consistent with my calculations.

__kernel void getTransformation( __global uint* permut1, __global float2* dot1,__global int4* combo1, __global float2* dot2,__global int4* combo2, , int size1, int size2, __global float4* trans)
{  int gid = get_global_id(0);  float2 temp_dot;  float min_dot = FLT_MAX;  int ind = 0;  for(int i=0;i<size2;i++)  {  temp_dot = (dot2[i].x - dot1[permut1[gid]].x,dot2[i].y - dot1[permut1[gid]].y);  if((temp_dot.x*temp_dot.x + temp_dot.y*temp_dot.y) < min_dot)  {  min_dot = temp_dot.x*temp_dot.x + temp_dot.y*temp_dot.y;  ind = i;  }  }  float4 num_pl2 = combo2[ind];  trans[gid] = convert_float4_rtp(num_pl2);
}

↧

Why does increasing the number of kernel arguments impacts performance ?

July 18, 2013, 3:04 am

≫ Next: GPU progamming in OpenCL

≪ Previous: Finding minimum of square of difference between two arrays

GPU: 7970

OS: Kubuntu 12.04 x64

Driver: Catlyst 13.4

Hi,

I know this sounds crazy but it is actually true. Increasing the number of kernel arguments beyond a certain number causes the performance to drop. In one of my kernels if I move from 6 to 7 kernel arguments performance drops by 30% even though I don't do any computation with the the new kernel argument. I can't reduce the number of arguments by assembling them into a struct because all of them are dynamic arrays of variable length. Is there a way to avoid this problem ?

Regards,

Sayantan

↧

GPU progamming in OpenCL

April 13, 2014, 2:16 am

≫ Next: Why only 256 workitems per workgroup for ATi GPU?

≪ Previous: Why does increasing the number of kernel arguments impacts performance ?

Hi guys,

I brought a ATI Radeon R9290X GPU card but found out it is not supported by Matlab. Either change it to GeForce or learn how to use it in OpenCL.

Could anyone has any example show me how to program OpenCL to make use the calculation power of the GPU. My program is in Matlab and they are all calculation with no graphic inside.

↧

Why only 256 workitems per workgroup for ATi GPU?

April 15, 2014, 6:18 am

≫ Next: Bug in clEnqueueTask?

≪ Previous: GPU progamming in OpenCL

For CPU AMD supports 1024 workitems.

NV supports 1024 for GPUs from very beginning...

AMD has same or larger amount of shared memory, same or larger register file... so why such limitation?

Why only 4 waves per workigroup? If one need to share whole LDS he limited with only 4 wavefronts per CU no matter how many registers remained. But more waves in flight would result in better CU usage and latency hiding... So this 256 limitation looks like quite artifical and not good for performance in some cases.

Hence the question why? Why this arbitrary size of 256 was chosen? Are the reasons still important for new GPUs or they could have bigger workgroup size?

↧

Bug in clEnqueueTask?

April 16, 2014, 3:21 pm

≫ Next: How GPU handles its code?

≪ Previous: Why only 256 workitems per workgroup for ATi GPU?

I was using CodeXL to do some profiling and it kept saying:

Opencl Memory leak detected [Ref =1] Object created by clEnqueueNDRangeKernel

Eventually it was narrowed down to this line:

status = clEnqueueTask( myQueue, myKernel, 0, NULL, NULL) ;

After changing it to:

status = clEnqueueTask( myQueue, myKernel, 0, NULL, &clEvent ) ;

clReleaseEvent ( clEvent ) ;

The memory leak went away.

As per the specs, when passing in NULL no event object should be created, but it clearly was in this case.

I tried it on a 6950 and 280x, both using the 13.12 catalyst driver and they both have this issue.

Edit:

clEnqueueNDRangeKernel( myQueue, myKernel, workDims, global_work_offset,(const size_t *)global_work_size, local_work_size, 0, NULL, NULL) ;

If I use this instead of clEnqueueTask it also runs into the same problem with a memory leak, which can be solved the same way.

↧

How GPU handles its code?

April 18, 2014, 1:17 am

≫ Next: clBuildProgram crash

≪ Previous: Bug in clEnqueueTask?

I mean how kernel binary passed to GPU (on context creation, before kernel launch and so forth)?

Where it stored? (CPU uses usual system memory for code storage but has separare L1 instruction cache to speed up instruction fetching. How about GPU? Does kernel binary stored in GPU global memory or in special limited-size buffer? How instructions are fetched? Do they go through common cache with data or constant cache used or smth special instead?)

How big possible performance impact for code bloating? (I mean if one uses few different special kernels to do similar work instead of one more slow/complex but universal one - how increase in total number of kernels and increase in total kernels binary size will impact performance? )

There is info regarding data memory handling on GPU but not so much if any regarding code data handling. Worth to discuss?

↧

clBuildProgram crash

March 4, 2014, 9:45 am

≫ Next: Possible bug in read_imagef

≪ Previous: How GPU handles its code?

This kernel runs fine on my computer on a CPU device and the CodeXL Kernel Analyzer verifies it should work on a wide variety of cards - everything HD5k or newer for image support. I'm unfortunately currently stuck with a HD4870 and therefore cannot test it myself. On my roommates computer, however, clBuildProgram reports an unexpected exception, both in my own host application and the Kernel Analyzer. We installed new drivers, the APP SDK and CodeXL but nothing seems to work. Is this a driver bug or is the code actually somehow wrong?


constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;

typedef struct {
float4 orig;
float4 dir;
} CLRay;

typedef struct {
float2 uv;
float t;
int id;
} CLResult;

typedef struct {
float4 a1, a2, a3;
} CLTriangle;

typedef struct {
float4 nor, col;
} CLTriangledata;

kernel void fill_image( image2d_t img,
constant CLRay * ray,
constant CLResult * result,
global CLTriangledata * triangledata,
float N) {
const int2 id = (int2){get_global_id(0), get_global_id(1)};
const int id1d = id.y*get_global_size(0)+id.x;
int res = result[id1d].id;
float4 col = (float4).0f;
if(res>=0)
col = pow(triangledata[res].col, 2.2f) * fabs(dot(ray[id1d].dir, triangledata[res].nor));
float4 oldcol = read_imagef(img, sampler, id);
write_imagef(img, id, (oldcol*(N-1.0f)+col)/N);
}

↧

Possible bug in read_imagef

April 19, 2014, 9:48 am

≫ Next: Using CL_MEM_ALLOC_HOST_PTR on buffer for writing output (BufferBandwidth SDK sample)

≪ Previous: clBuildProgram crash

Strange behavior - kernel works fine with this:

__constant sampler_t imageSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST, using this - read_imagef(image, imageSampler, coords);

but!..

without sampler and with "read_imagef(image, coords)" (should work the same way as above) - data goes randomly crippled.

An image is a 3D-image, referenced from OpenGL, rendered there slice by slice through FBO.

Seems like a bug?

P.S. Other regular 2D-image works fine with or without sampler.

↧