Quantcast
Channel: Community : All Content - OpenCL
Viewing all 2400 articles
Browse latest View live

x264 demo not working on AMD platform

$
0
0

Hi,

 

I have downloaded a OpenCL enabled x264 encoder from Universität Heidelberg , compilation was fine on AMD platform but it failed to run on Ubuntu_13.10 + HD7970, segmentation fault.

Sharing the compiled binaries here:

 

Command to run the demo: # ./x264 --threads 8 -A none --no-cabac --no-deblock --subme 0 --me dia --qp 16 --output out.264 result.y4m

 

change x264 executable permission if required.

 

Can some one help me on it?


OpenCL Samples Visual Studio Compile Errors

$
0
0

I just downloaded the the AMD OpenCL SDK and tried compiling one of the samples in Visual Studio 2010. These are the errors I get:

Error 1 error C1083: Cannot open include file: 'CL/cl.hpp': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\samples\opencl\cl\DynamicOpenCLDetection\VectorAddition\VectorAddition.cpp 20 1 VectorAddition

Error 2 error C1083: Cannot open include file: 'CL/cl.h': No such file or directory c:\users\watkinsp\amd app sdk\2.9\samples\opencl\cl\template\Template.hpp 24 1 Template

Error 3 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 BufferImageInterop

Error 4 error LNK1104: cannot open file 'OpenCL.lib' C:\Users\watkinsp\AMD APP SDK\2.9\samples\opencl\cl\DynamicOpenCLDetection\LINK DynamicOpenCLDetection

Error 5 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 TransferOverlap

Error 6 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 URNG

Error 7 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 SobelFilter

Error 8 error C1083: Cannot open include file: 'CL/opencl.h': No such file or directory C:\Users\watkinsp\AMD APP SDK\2.9\include\SDKUtil\CLUtil.hpp 24 1 SimpleImage

 

I haven't modified anything. I guess I just assumed I could open the provided Visual Studio 2010 solution and compile one of the samples out of the box. Did I miss an installation step?

PCIE Performance

$
0
0

Can't understand: "bufferbandwidth -pcie" shows me 12GBPS in both ways, "bufferbandwidth -if 6" - only 6GBPS in memset. Why? Haswell i7, Z87, Tahiti.

OpenGL-OpenCL Interoperability

$
0
0

Dear AMD,

 

Could you clarify, please, the right way to access GL-CL interop. in this system:

Windows environment (Windows 7 x64).

One or several AMD GPUs.

Primary display can be not an AMD one (integrated - Intel or something else).

There should be possible to start application over RDP.

 

My goal is to create interconnected GL-CL contexts for each GPU to build some geometry (OpenCL) and render it into FBO (OpenGL). There is no need to show anything (except console messages - no problem here).

 

The problem is that while spending a huge amount of time, playing with OpenGL, OpenCL, Win32, I found no way to do it properly. Is the only chance to create these contexts - to make fake windows all over the infinitely stretched desktop with dummy plug in each of the card to keep them switched on? And still I'm not pretty sure, that it will work in all cases (RDP) of aforementioned environment.

So, what I know for now:

1. GL context cannot be created from CL one - this is the most disappointing fact;

2. wgl_amd_gpu_association - it needs one window-related GL context to generate others, but the main problem is that these others are useless(CL_INVALID_SHAREGROUP_REFERENCE_KHR) for CL context creation - I don't know the reason, but this is the second disappointing fact.

3. Last chance - try to access windows display devices, find an AMD one and use CreateDC instead of GetDC(window handle). Works fine with AMD as a primary display. As a secondary - access violation exception in ChoosePixelFormat and no GL context as a result.

 

P. S. By the way, please, look into SimpleGL sample (and AMD APP OpenCL Programming Guide, Appendix G - 7.1.1.1, step 8) - what ZeroMemory(&pfd,... there for ? After pfd initialization? And it works somehow!

 

Please, help.

Calculation error on GPU only

$
0
0

Hi,

 

as updates to older threads don't seem to receive a lot of attention, I'm creating this new one. I'm asking to head over to the original thread and help me find a workaround or solution to this "behavior" which I'd call an AMD-OCL-compiler bug (until proven otherwise ;-).

 

Thanks

Cross-platform OpenCL APU Development.

$
0
0

I've looked at some comparisons betwen C++ AMP and OpenCL, and OpenCL is what I want to develop for, for many reasons. Mainly it being cross-platform and not as hardware specific as AMP or CUDA. It also well written code seems to perform better than it does on AMP, while CUDA is a non-starter being Nvidia only.

 

Anyway, my question would be how to properly develop/release OpenCL applications. I noticed there are Intel, AMD, ARM, and Nvidia SDKs. For now, I'm mostly concerned with developing for AMD and Intel APUs, I'm not much interested at all in GPUs. Mostly because the type of programs I would write would require very good shrared memory performance, and APUs have it even if they don't have as fast or as many cores. I believe the APU will be the true successor to the math co-processor that we have had since the days of the 386, what we call an FPU today.

 

Anyway, my concern is, what is the difference between the AMD and Intel OpenCL SDKs? Will the same OpenCL code (minus specific optimizations) run on both SDKs? I'm planning to compile my OpenCL applications as DLLs seperately, and call them from a VC++ applicatoin after profiling the target's system. The idea would be to use a binary produced by the AMD SDK for an AMD CPU/APU and an the Intel SDK for an Intel CPU/APU. The two "projects" would share much of the same source-code, while having differences in their headers / kernel classes. I'm just wanting to confirm that this approach would be reasonable or if it has been documented.

 

I would also be very much interested in any books/literature on this topic. If anyone has anything else I should be aware of please, let me know. Thank you.

Hello world does not compile

$
0
0

Hello,

 

I am following this "Hello world" tutorial and it won't compile.  It is returning an unknown pragma warning on this command which seems to make the compile fail:

 

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

 

I am compiling through the "Developer Command Prompt for VS2013" using this command and I have copied the OpenCL.lib file into my local folder.

 

cl /Fe"hello_world.exe" /I"C:\Program Files (x86)\AMD APP SDK\2

.9\include" lesson.cpp lib/OpenCL.lib


I have copied the file I'm using verbatim below in case I misunderstood something from the tutorial.  I have the complete APP SDK 2.9, an AMD A10 APU, and recent drivers.  I don't know why it's having issues unless the extension is no longer supported.  This is the closest I've seen to a list of supported extensions.  I'm not sure what the problem is.  I've included the complete compiler output at the bottom.


Thank you if anyone can show me why it isn't working.


 

lesson.cpp:


"""

#include <utility>

#define __NO_STD_VECTOR // Use cl::vector instead of STL version

#include <CL/cl.hpp>

#include <cstdio>

#include <cstdlib>

#include <fstream>

#include <iostream>

#include <string>

#include <iterator>

 

const std::string hw("Hello World\n");

 

inline void checkErr(cl_int err, const char * name) {

    if (err != CL_SUCCESS) {

        std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl;

        exit(EXIT_FAILURE);

    }

}

 

int main(void)

{

    cl_int err;

    cl::vector< cl::Platform > platformList;

    cl::Platform::get(&platformList);

    checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get");

    std::cerr << "Platform number is: " << platformList.size() << std::endl;

    std::string platformVendor;

    platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, &platformVendor);

    std::cerr << "Platform is by: " << platformVendor << "\n";

    cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0};

    cl::Context context(CL_DEVICE_TYPE_CPU, cprops, NULL, NULL, &err);

    checkErr(err, "Conext::Context()");

 

    char * outH = new char[hw.length()+1];

    cl::Buffer outCL(context,

                     CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,

                     hw.length()+1,

                     outH,

                     &err);

    checkErr(err, "Buffer::Buffer()");

 

    cl::vector<cl::Device> devices;

    devices = context.getInfo<CL_CONTEXT_DEVICES>();

    checkErr(devices.size() > 0 ? CL_SUCCESS : -1, "devices.size() > 0");

 

    std::ifstream file("lesson1_kernels.cl");

    checkErr(file.is_open() ? CL_SUCCESS:-1, "lesson1_kernel.cl");

    std::string prog(std::istreambuf_iterator<char>(file),

                    (std::istreambuf_iterator<char>()));

    cl::Program::Sources source(1,

                                std::make_pair(prog.c_str(), prog.length()+1));

    cl::Program program(context, source);

    err = program.build(devices,"");

    checkErr(err, "Program::build()");

 

    cl::Kernel kernel(program, "hello", &err);

    checkErr(err, "Kernel::Kernel()");err = kernel.setArg(0, outCL);

    checkErr(err, "Kernel::setArg()");

 

    cl::CommandQueue queue(context, devices[0], 0, &err);

    checkErr(err, "CommandQueue::CommandQueue()");cl::Event event;

    err = queue.enqueueNDRangeKernel(kernel,

                                     cl::NullRange,

                                     cl::NDRange(hw.length()+1),

                                     cl::NDRange(1, 1),

                                     NULL,

                                     &event);

    checkErr(err, "ComamndQueue::enqueueNDRangeKernel()");

 

    event.wait();

    err = queue.enqueueReadBuffer(outCL,

                                  CL_TRUE,

                                  0,

                                  hw.length()+1,

                                  outH);

    checkErr(err, "ComamndQueue::enqueueReadBuffer()");

    std::cout << outH;

    return EXIT_SUCCESS;

}

 

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

__constant char hw[] = "Hello World\n";

__kernel void hello(__global char * out)

{

    size_t tid = get_global_id(0);

    out[tid] = hw[tid];

}

"""



Complete compiler output:

"""

D:\Programs\openCL>cl /Fe"hello_world.exe" /I"C:\Program Files (x86)\AMD APP SDK

\2.9\include" lesson.cpp lib/OpenCL.lib

Microsoft (R) C/C++ Optimizing Compiler Version 18.00.21005.1 for x86

Copyright (C) Microsoft Corporation.  All rights reserved.

 

lesson.cpp

C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\INCLUDE\xlocale(337) : wa

rning C4530: C++ exception handler used, but unwind semantics are not enabled. S

pecify /EHsc

lesson.cpp(81) : warning C4068: unknown pragma

lesson.cpp(82) : error C2144: syntax error : 'char' should be preceded by ';'

lesson.cpp(82) : error C4430: missing type specifier - int assumed. Note: C++ do

es not support default-int

lesson.cpp(82) : error C2373: 'hw' : redefinition; different type modifiers

        lesson.cpp(11) : see declaration of 'hw'

lesson.cpp(83) : error C2144: syntax error : 'void' should be preceded by ';'

lesson.cpp(83) : error C4430: missing type specifier - int assumed. Note: C++ do

es not support default-int

lesson.cpp(83) : error C2065: '__global' : undeclared identifier

lesson.cpp(83) : error C2144: syntax error : 'char' should be preceded by ')'

lesson.cpp(83) : error C2448: 'hello' : function-style initializer appears to be

a function definition

lesson.cpp(83) : error C2059: syntax error : ')'

lesson.cpp(85) : error C2144: syntax error : 'size_t' should be preceded by '}'

lesson.cpp(85) : error C2144: syntax error : 'size_t' should be preceded by ';'

lesson.cpp(85) : error C3861: 'get_global_id': identifier not found

lesson.cpp(86) : error C2057: expected constant expression

lesson.cpp(86) : error C2466: cannot allocate an array of constant size 0

lesson.cpp(86) : error C4430: missing type specifier - int assumed. Note: C++ do

es not support default-int

lesson.cpp(86) : error C2372: 'out' : redefinition; different types of indirecti

on

        lesson.cpp(83) : see declaration of 'out'

lesson.cpp(86) : error C2088: '[' : illegal for class

lesson.cpp(87) : error C2059: syntax error : '}'

lesson.cpp(87) : error C2143: syntax error : missing ';' before '}'

"""



 

Does M290X support fully VCE like desktop graphics do?

$
0
0

Hi,

 

Before I will order a laptop with M290X I would like to confirm that VCE is supported by this card.

 

on AMD website there is a nice table of features for desktop R9 series with clear  Video Codec Engine (VCE) (with H.264, MPEG-4 ASP, MPEG-2, VC-1 & Blu-ray 3D) support.


For mobile R9 there is no information. I hope you can provide me a valueable info about this.


Thanks.


Regards

Pawel.


Scratch registers - how to prevent their usage?

$
0
0

I use fixed-size array in registers to reduce fetch size required by kernel.

 

At some size (11 elements) kernel performance dropped considerably (3 times slowdown) and 22 scratch registers were used.

 

Kernel occupancy is 25% that corresponds to 8 waves per CU.

 

That is, instead of using only single workgroup of 4 waves and no scratch registers compiler decided to keep 8 waves (2 workgroups) per CU but introduce 22 scratch registers.

Cause performance dropped greatly it's obviously bad choice.

 

At array size of 10 there are no scratch registers at all, 8 waves and 3 1VGPR used (I profiling kernel on Loveland GPU).

At array size of 11 there are 22 scratch registers, 31 VGPR and 8 waves too.

 

Is it possible to tell compiler somehow not to use scratch registers and decrease number of waves in fly instead?

I expect much better performance with more register space used per workitem even if number of waves in flight will be decreased to only 4.

 

 

Here is ISA for length of 10:

; -------- Disassembly -------------------- 00 ALU_PUSH_BEFORE: ADDR(64) C - Pastebin.com

 

And here for len of 11:

 

; -------- Disassembly -------------------- 00 ALU_PUSH_BEFORE: ADDR(224 - Pastebin.com

clFFT performance evaluation

$
0
0

I have been working on performance evaluation of the clFFT library on GPU, AMD Radeon R7 260x. The CPU is intel xeon inside and OS is centOS.

 

I have been studying the performance of 2D 16x16 clFFT with different batch modes (Parallel FFTs). I wondered to see the different results obtained from espicially event profiling and gettimeofday.

 

The results of 2D 16x16 clFFT with different batch modes are as following,

 

Using EventProfiling

 

batch kernel exec time(us)

  1    320.7
  16   461.1
256   458.3
512   537.7
1024   1016.8

 

Here, the batch represents the parallel FFTs and the kernel execution time represents the execution time in micro seconds.

 

Using gettimeofday

 

batch HtoD(us) kernelExecTime(us) DtoH(us)

  1   29653    10850               39227
  16  28313    10786               32474
256  26995    11167               39672
512  26145    10773               32273
1024  26856    11948               31060

Here, the batch represents the parallel FFTs, H to D represents data transfer time from host to device, the kernel exec time represents the kernel execution time and D to H represents the data transfer time from device to host and all are in micro seconds.

 

(I am sorry as I cant show you the results in good table format, I can not able to add tables here. hope you can read still). Here are my questions,

 

1a) Why the kernel times obtained from EventProfiling are completely different from that of gettimeofday?

 

1b) Here the anothor question is that, which results are correct?

 

2) The data (w.r.t size) transfers increases as the batch size increases. Bur from the resluts of the gettimeofday, the data transfer times either the H to D or D to H are almost constant instead of growing as the batch size increases from 1 to 1024. Why is that?

 

For a reference, a peace of code is given below,

 

clFinish( cl_queue);

// Copy data from host to device
gettimeofday(&t_s_gpu1, NULL);
clEnqueueWriteBuffer( cl_queue, d_data, CL_TRUE, 0, width*height*batchSize*sizeof(cl_compl_flt), h_src, 0, NULL, &event1);
clFinish( cl_queue);
clWaitForEvents(1, &event1);
gettimeofday(&t_e_gpu1, NULL);

checkCL( clAmdFftBakePlan( fftPlan, 1, &cl_queue, NULL, NULL) );

clAmdFftSetPlanBatchSize( fftPlan, batchSize );
clFinish( cl_queue);

gettimeofday(&t_s_gpu, NULL);
checkCL( clAmdFftEnqueueTransform( fftPlan, CLFFT_FORWARD, 1, &cl_queue, 0, NULL, &event, &d_data, NULL, NULL) );
clFinish( cl_queue);
clWaitForEvents(1, &event);
gettimeofday(&t_e_gpu, NULL);

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);

totaltime=totaltime+time_end - time_start;
clFinish( cl_queue);

// Copy result from device to host

gettimeofday(&t_s_gpu2, NULL);
checkCL( clEnqueueReadBuffer(cl_queue, d_data, CL_TRUE, 0, width*height*batchSize*sizeof(cl_compl_flt), h_res, 0, NULL, &event2));
clFinish( cl_queue);
clWaitForEvents(1, &event2);
gettimeofday(&t_e_gpu2, NULL);

 

I will be looking for your comments and answers and load of thanks in advance.

 

Best Regards,

Sreehari.

 

Message was edited by: Sreehari Ambuluri

Am having an assertion error arising from clBuildProgram in my code on Visual Studio 2012

$
0
0

I am trying to run a program that has three (3) implementations of raytracing algorithm (C++ Recursive version, C++ Iterative version and OpenCL version). The recursive and the iterative versions both run alright, my challenge is with the  OpenCL version. I getting a lot of errors. After several modifications and tweaking am finally left with just a few errors one of which is caught when the clBuildProgram is called.

 

Perhaps am not entering the right parameters or am missing something minor. The thing is the whole project builds well with no errors only a few warnings.

I am hoping someone can point me in the right direction. Here are the specifications of the system am using:

 

CPU: AMD E-50 processor 1.6Ghz

OpenCL SDK : AMD APP 2.7

IDE: Visual Studio 2012

 

The error occurs on line 23 of the code excerpt below:

 



void OpenCLManager::MapKernelNameToProgram(std::string a_kernelName, std::string a_programName) {
  // Build CPU programs  {  cl_int err = 0;  std::string programName(a_programName);  std::string kernelName(a_kernelName);  programName += " OpenCL_CPU";  kernelName += " OpenCL_CPU";  cl_program program = m_MapProgramNameToProgram[programName];  if (!program)  {  // Load the source code to a string.  char* programSource = this->LoadProgramSource(a_programName.c_str());  // Create the program with source string.  program = clCreateProgramWithSource(m_ContextCPU, 1, (const char**)&programSource, NULL, &err);  assert(err == CL_SUCCESS);  // Build the program.  err = clBuildProgram(program, 1,&m_DeviceCPU, NULL, NULL, NULL);  char build[2048];  clGetProgramBuildInfo(program, m_DeviceCPU, CL_PROGRAM_BUILD_LOG, 2048, build, NULL);  printf("Build Log:\n%s\n",build); // Prints any build errors  assert(err == CL_SUCCESS);  m_MapProgramNameToProgram[programName] = program;  }  m_MapKernelNameToProgram[kernelName] = program;  }  // Build GPU programs  {  cl_int err = 0;  std::string programName(a_programName);  std::string kernelName(a_kernelName);  programName += " OpenCL_GPU";  kernelName += " OpenCL_GPU";  cl_program program = m_MapProgramNameToProgram[programName];  if (!program)  {  // Build the program, since it has not yet been built.  // Load the source code to a string.  char* programSource = this->LoadProgramSource(a_programName.c_str());  // Create the program with source string.  program = clCreateProgramWithSource(m_ContextGPU, 1, (const char**)&programSource, NULL, &err);  assert(err == CL_SUCCESS);  // Build the program.  //err = clBuildProgram(program, 1, &m_DeviceGPU, NULL, NULL, NULL);  char build[2048];  clGetProgramBuildInfo(program, m_DeviceGPU, CL_PROGRAM_BUILD_LOG, 2048, build, NULL);  printf("Build Log:\n%s\n",build); // Prints any build errors  assert(err == CL_SUCCESS);  m_MapProgramNameToProgram[programName] = program;  }  m_MapKernelNameToProgram[kernelName] = program;  }
}


char* OpenCLManager::LoadProgramSource(const char* filename) { 
  struct stat statbuf;  FILE *fh;  char *source;   fh = fopen(filename, "r");  if (fh == 0)  return 0;   stat(filename, &statbuf);  source = (char *) malloc(statbuf.st_size + 1);  fread(source, statbuf.st_size, 1, fh);  source[statbuf.st_size] = '\0';  return source; 
} 


//void OpenCLManager::InitContext( void ) {
// cl_int err = 0;
// // Create a context to perform our calculation with the 
// // specified device 
// m_ContextCPU = clCreateContext(0, 1, &m_DeviceCPU, NULL, NULL, &err);
// assert(err == CL_SUCCESS);
// m_ContextGPU = clCreateContext(0, 1, &m_DeviceGPU, NULL, NULL, &err);
// assert(err == CL_SUCCESS);
//
//}
//
//void OpenCLManager::InitCommandQueue() {
// m_CommandQueueCPU = clCreateCommandQueue(m_ContextCPU, m_DeviceCPU, 0, NULL);
// m_CommandQueueGPU = clCreateCommandQueue(m_ContextGPU, m_DeviceGPU, 0, NULL);
//}  // TODO: give some better way of notifying whether or not this was successful.
void OpenCLManager::InitDevice() {  // Init CPU device  m_ContextCPU= NULL;  m_ContextGPU= NULL;  m_CommandQueueCPU= NULL;  m_CommandQueueGPU= NULL;  m_DeviceCPU= NULL;  m_DeviceGPU= NULL;  cl_platform_id platform_id = NULL;  m_DeviceType= OpenCL_CPU;  m_raytraceKernelWorkGroupSize= NULL;  cl_uint ret_num_devices;  cl_uint ret_num_platforms;  cl_int err = 0;  /* Get Platform and Device Info */  err = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);  // Find the CPU CL device, as a fallback  err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_DEFAULT, 1, &m_DeviceCPU, &ret_num_devices);  std::cout << err << std::endl;  assert(err == CL_SUCCESS);  // Init GPU device  // Find the GPU CL device, this is what we really want  // If there is no GPU device is CL capable, fall back to CPU  err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &m_DeviceGPU, NULL);  if (err != CL_SUCCESS) m_DeviceGPU = m_DeviceCPU;  assert(m_DeviceGPU);  // Create a context to perform our calculation with the   // specified device   m_ContextCPU = clCreateContext(NULL, 1, &m_DeviceCPU, NULL, NULL, &err);  assert(err == CL_SUCCESS);  m_ContextGPU = clCreateContext(NULL, 1, &m_DeviceGPU, NULL, NULL, &err);  assert(err == CL_SUCCESS);  /* Create Command Queue */  m_CommandQueueCPU = clCreateCommandQueue(m_ContextCPU, m_DeviceCPU, 0, &err);  m_CommandQueueGPU = clCreateCommandQueue(m_ContextGPU, m_DeviceGPU, 0, &err);  // Construct the KernelName -> Program map.  this->MapKernelNameToProgram(RAYTRACE_KERNEL, RAY_TRACER_PROGRAM);




}

Please also find attached the visual studio solution. Thank you!

Please find attached the Visual studio solution

Concurrent OpenCL-kernel execution and Windows/Linux driver differences (Hawaii GPU)

$
0
0

Hi,

 

I've written an OpenCL application which runs much faster in Linux if one calls kernels concurrently on R9 290(X) devices because only one of the three kernels has high register usage. The performance of GCN-based devices (excluding Hawaii) scales very well with CU count and GPU core frequency in Windows and Linux. That's my problem and my question:

  1. Hawaii devices have a significant performance drop in Windows compared to Linux (about 1/3rd of the performance is lost) because it seems impossible to execute kernels in parallel on the device. Is this caused by different feature sets of the Windows and Linux Catalyst driver?
  2. If a monitor is attached to the GPU and kernels are called concurrently in Windows, then high CPU usage is observed (any GCN-based device). The CPU does nothing else but calling OpenCL kernels and reading/writing a few bytes to device memory. Is this a driver bug?

 

with best regards,

NaN

Finding minimum of square of difference between two arrays

$
0
0

Hi,

I have been trying to execute a simple kernel but it returns garbage values and I am unable to figure why. I want to find the closest set of planes from a given plane set using the angles between the planes. So, the criteria is to find the minimum of the square of the difference of the corresponding angles. In this case, the correct answer should be given as the planes which have near similar orientation. I am getting the desirable answer in CPU. But when I am sending it to kernel, it sends out a different answer not consistent with my calculations.

 

__kernel void getTransformation( __global uint* permut1, __global float2* dot1,__global int4* combo1, __global float2* dot2,__global int4* combo2, , int size1, int size2, __global float4* trans)
{  int gid = get_global_id(0);  float2 temp_dot;  float min_dot = FLT_MAX;  int ind = 0;  for(int i=0;i<size2;i++)  {  temp_dot = (dot2[i].x - dot1[permut1[gid]].x,dot2[i].y - dot1[permut1[gid]].y);  if((temp_dot.x*temp_dot.x + temp_dot.y*temp_dot.y) < min_dot)  {  min_dot = temp_dot.x*temp_dot.x + temp_dot.y*temp_dot.y;  ind = i;  }  }  float4 num_pl2 = combo2[ind];  trans[gid] = convert_float4_rtp(num_pl2);
}

Why does increasing the number of kernel arguments impacts performance ?

$
0
0

GPU: 7970

OS: Kubuntu 12.04 x64

Driver: Catlyst 13.4

 

Hi,

 

I know this sounds crazy but it is actually true. Increasing the number of kernel arguments beyond a certain number causes the performance to drop. In one of my kernels if I move from 6 to 7 kernel arguments performance drops by 30% even though I don't do any computation with the the new kernel argument. I can't reduce the number of arguments by assembling them into a struct because all of them are dynamic arrays of variable length. Is there a way to avoid this problem ?

 

Regards,

Sayantan 

GPU progamming in OpenCL

$
0
0

Hi guys,

 

I brought a ATI Radeon R9290X GPU card but found out it is not supported by Matlab. Either change it to GeForce or learn how to use it in OpenCL.

 

Could anyone has any example show me how to program OpenCL to make use the calculation power of the GPU. My program is in Matlab and they are all calculation with no graphic inside.


Why only 256 workitems per workgroup for ATi GPU?

$
0
0

For CPU AMD supports 1024 workitems.

NV supports 1024 for GPUs from very beginning...

AMD has same or larger amount of shared memory, same or larger register file... so why such limitation?

Why only 4 waves per workigroup? If one need to share whole LDS he limited with only 4 wavefronts per CU no matter how many registers remained. But more waves in flight would result in better CU usage and latency hiding... So this 256 limitation looks like quite artifical and not good for performance in some cases.

 

Hence the question why? Why this arbitrary size of 256 was chosen? Are the reasons still important for new GPUs or they could have bigger workgroup size?

Bug in clEnqueueTask?

$
0
0

I was using CodeXL to do some profiling and it kept saying:

 

Opencl Memory leak detected [Ref =1] Object created by clEnqueueNDRangeKernel

 

Eventually it was narrowed down to this line:

 

status = clEnqueueTask( myQueue, myKernel, 0, NULL, NULL) ;

 

After changing it to:

 

status = clEnqueueTask( myQueue, myKernel, 0, NULL, &clEvent ) ;

    clReleaseEvent ( clEvent ) ;

 

The memory leak went away.

 

As per the specs, when passing in NULL no event object should be created, but it clearly was in this case.

 

I tried it on a 6950 and 280x, both using the 13.12 catalyst driver and they both have this issue.

 

Edit:

 

clEnqueueNDRangeKernel( myQueue, myKernel, workDims, global_work_offset,(const size_t *)global_work_size, local_work_size, 0, NULL, NULL) ;

 

If I use this instead of clEnqueueTask it also runs into the same problem with a memory leak, which can be solved the same way.

How GPU handles its code?

$
0
0

I mean how kernel binary passed to GPU (on context creation, before kernel launch and so forth)?

Where it stored? (CPU uses usual system memory for code storage but has separare L1 instruction cache to speed up instruction fetching. How about GPU? Does kernel binary stored in GPU global memory or in special limited-size buffer? How instructions are fetched? Do they go through common cache with data or constant cache used or smth special instead?)

How big possible performance impact for code bloating? (I mean if one uses few different special kernels to do similar work instead of one more slow/complex but universal one - how increase in total number of kernels and increase in total kernels binary size will impact performance?  )

 

There is info regarding data memory handling on GPU but not so much if any regarding code data handling. Worth to discuss?

clBuildProgram crash

$
0
0

This kernel runs fine on my computer on a CPU device and the CodeXL Kernel Analyzer verifies it should work on a wide variety of cards - everything HD5k or newer for image support. I'm unfortunately currently stuck with a HD4870 and therefore cannot test it myself. On my roommates computer, however, clBuildProgram reports an unexpected exception, both in my own host application and the Kernel Analyzer. We installed new drivers, the APP SDK and CodeXL but nothing seems to work. Is this a driver bug or is the code actually somehow wrong?

 


constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;

typedef struct {
float4 orig;
float4 dir;
} CLRay;

typedef struct {
float2 uv;
float t;
int id;
} CLResult;

typedef struct {
float4 a1, a2, a3;
} CLTriangle;

typedef struct {
float4 nor, col;
} CLTriangledata;

kernel void fill_image( image2d_t img,
constant CLRay * ray,
constant CLResult * result,
global CLTriangledata * triangledata,
float N) {
const int2 id = (int2){get_global_id(0), get_global_id(1)};
const int id1d = id.y*get_global_size(0)+id.x;
int res = result[id1d].id;
float4 col = (float4).0f;
if(res>=0)
col = pow(triangledata[res].col, 2.2f) * fabs(dot(ray[id1d].dir, triangledata[res].nor));
float4 oldcol = read_imagef(img, sampler, id);
write_imagef(img, id, (oldcol*(N-1.0f)+col)/N);
}

Possible bug in read_imagef

$
0
0

Strange behavior - kernel works fine with this:

__constant sampler_t imageSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST, using this - read_imagef(image, imageSampler, coords);

but!..

without sampler and with "read_imagef(image, coords)" (should work the same way as above)  - data goes randomly crippled.

An image is a 3D-image, referenced from OpenGL, rendered there slice by slice through FBO.

Seems like a bug?

P.S. Other regular 2D-image works fine with or without sampler.

Viewing all 2400 articles
Browse latest View live


Latest Images