OPENCL Advanced Features Giuseppe Tagliavini [email protected] Christian Pinto [email protected] Luca Benini [email protected] OUTLINE Profiling Pinned memory Local buffering Differences between P2012 and GPUs Smart memory management: DMA PROFILING (1/3) Enable profiling on the device command queue: cl_command_queue queue; cl_int error; queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &error); Enable profiling Command Queue Context Error handling PROFILING (2/3) cl_int error; cl_ulong param_value; error = clGetEventProfilingInfo(event, param_name, param_value_size, ¶m_value, ¶m_value_size_ret); event Specifies the event object param_name CL_PROFILING_COMMAND_QUEUED CL_PROFILING_COMMAND_SUBMIT CL_PROFILING_COMMAND_START CL_PROFILING_COMMAND_END param_value_size sizeof(cl_ulong) param_value A pointer to memory where the result being queried is returned (device time counter, in nanoseconds) param_value_size_ret NULL PROFILING (3/3) On the host side, we can read the system clock: #include <time.h> struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); // Host code... clock_gettime(CLOCK_MONOTONIC, &end); unsigned long delta = (end.tv_sec*10E-9 + end.tv_nsec) – (start.tv_sec*10E-9 + start.tv_nsec); PROFILING ON NVIDIA PLATFORMS export OPENCL_PROFILE=1 ./example cat opencl_profile_0.log # OPENCL_PROFILE_LOG_VERSION 2.0 # OPENCL_DEVICE 0 Tesla C2070 # OPENCL_CONTEXT 1 # TIMESTAMPFACTOR fffff6bbdff06bd0 method,gputime,cputime,occupancy method=[ memcpyHtoDasync ] gputime=[ 1.472 ] cputime=[ 7.000 ] method=[ memcpyHtoDasync ] gputime=[ 1.280 ] cputime=[ 4.000 ] method=[ add ] gputime=[ 4.096 ] cputime=[ 11.000 ] occupancy=[ 0.333 ] method=[ memcpyDtoHasync ] gputime=[ 3.712 ] cputime=[ 25.000 ] OUTLINE Profiling Pinned memory Local buffering Differences between P2012 and GPUs Smart memory management: DMA PINNED MEMORY Pinned memory is a region in host memory space which is not pageble Swapping is disabled It enables faster host device transfer times (DMA cl_mem buffer; buffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBufferBytes, NULL, &error); How can we access this memory area? MEMORY MAPPING clEnqueueMapBuffer enqueues a command to map a buffer object into the host address space, and returns a pointer to this mapped region unsigned char * hData; hData = (unsigned char *) clMapBuffer(queue, buffer, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, nbytes, 0, NULL, NULL, &error); Offset, number of bytes To unmap the memory region (safe for use by the device): clUnmapMemObject(queue, buffer, hData 0, NULL, NULL); OUTLINE Profiling Pinned memory Local buffering Differences between P2012 and GPUs Smart memory management: DMA CASE STUDY: MATRIX MULTIPLICATION for(int i = 0; i < heightA; i++) { for(int j = 0; j < widthB; j++) { C[i][j] = 0; for(int k = 0; k < widthA; k++) C[i][j] += A[i][k] * B[k][j]; } } Using single dimensional arrays: A[i][k] A[i*widthA + k] MATRIX MULTIPLICATION: KERNEL V1 kernel void simpleMultiply(global float* outputC, int widthA, int heightA, int widthB, int heightB, global float* inputA, global float* inputB) { int row = get_global_id(1); int col = get_global_id(0); NDRange size= [widthB, heightA] float sum = 0.0f; for(int i = 0; i < widthA; i++) sum += inputA[row*widthA+i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; } OPENCL MEMORY HIERARCHY Faster Smaller Slower Greater MATRIX MULTIPLICATION: KERNEL V2 kernel void coalescedMultiply(global float* outputC, int widthA, int heightA, int widthB, int heightB, global float* inputA, global float* inputB) { local float aTile[BLOCK_SIZE][widthA]; int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*widthA+x]; barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < widthA; i++) sum += aTile[y][i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; } MATRIX MULTIPLICATION: KERNEL V2 kernel void coalescedMultiply(global float* outputC, int widthA, int heightA, int widthB, int heightB, global float* inputA, global float* inputB) { local float aTile[BLOCK_SIZE][widthA]; int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*widthA+x]; barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < widthA; i++) 1. Local memory access is FASTER sum += aTile[y][i] * inputB[i*widthB+col]; 2. Transfers ot adjacent memory addresses are COALESCED = BLOCK_SIZE) sum; 3. outputC[row*widthB+col] Work-group size: (widthA, }4. Synchronization using local barrier MATRIX MULTIPLICATION: KERNEL V3 kernel void coalescedMultiply(global float* outputC, int widthA, int heightA, int widthB, int heightB, global float* inputA, global float* inputB) { local float aTile[BLOCK_SIZE][BLOCK_SIZE]; … for(int m = 0; m < widthA/BLOCK_SIZE; m++) { aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x]; barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < BLOCK_SIZE; i++) sum += aTile[y][i] * inputB[i*widthB+col]; barrier(CLK_LOCAL_MEM_FENCE); } … MATRIX MULTIPLICATION: KERNEL V3 kernel void coalescedMultiply(global float* outputC, int widthA, int heightA, int widthB, int heightB, global float* inputA, global float* inputB) { local float aTile[BLOCK_SIZE][BLOCK_SIZE]; … Local memory usage is limited!!! for(int m = 0; m < widthA/BLOCK_SIZE; m++) { aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x]; barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < BLOCK_SIZE; i++) sum += aTile[y][i] * inputB[i*widthB+col]; barrier(CLK_LOCAL_MEM_FENCE); } … Work-group size: (BLOCK_SIZE, BLOCK_SIZE) OUTLINE Profiling Pinned memory Local buffering Differences between P2012 and GPUs Smart memory management: DMA P2012 “COMPUTING” AS OPENCL DEVICE OpenCL Conceptual Architecture P2012 Architecture Private Memory Private Memory Private Memory Private Memory Proc 1 ProcM Proc1 Proc M Compute Unit 1 Compute Unit N Local Memory Local Memory DMA Constant Data Cache DMA Global Memory • Scalable programming model • Supports SPMD model • Supports Task parallelism • Covers complex memory hierarchy • Support async memory transfers Proc 1 ProcM Cluster 1 L1 256K shared Proc1 Proc M Cluster N L1 256K shared DM A DM A L3 External Memory • Scalable architecture (cluster based) • Supports SPMD with 0-cost branch divergence • Supports Task parallelism • Shared Local memory • 1D/2D DMA engines • Hardware synchronizer P2012 & OPENCL: KERNEL LEVEL PARALLELISM With P2012 it is possible to implement more complex OpenCL task graph (more complex than GPUs). Both task-level and datalevel (ND-Range) are possible P2012 OpenCL runtime does not accept more than 16 work-items per work group when creating an NDRange. This because having more work items than PEs would end in lots of context switches, which are really expensive in this architecture. P2012 & GPUS: DIFFERENCES P2012 Cores are mono-threaded while GPUs cores are highly multithreaded In GPU programming memory latencies are hidden running a massive battery of threads. GPUs have negligible task scheduling overhead. In P2012 memory latencies are hidden by DMA asynchronous copies because context switches are expensive. GPU cores execute in lock-step (SIMT fashion) All threads in a warp execute the same instruction flow, diverging threads cause an important performance loss. P2012 Cluster’s PEs can execute different instruction flows without affecting application performance. P2012 & GPUS: PROGRAMMING STYLE DIFFERENCES Data Data work-group 16 work-item work-group 16 work-item P2012 2 clusters 16 PE per cluster ND-Range 2 work-groups 16 work-items per WG GPU Several clusters 32 PE per cluster ND-Range - 1 work item per data element - number of WGs on the size of WG 22 OUTLINE Profiling Pinned memory Local buffering Differences between P2012 and GPUs Smart memory management: DMA P2012&OPENCL: SMART MEMORY USAGE Problem: External L3 Memory accesses have an high cost (hundred cycles) P2012 cannot hide memory latencies with thread scheduling Solution: Use local memory as a user managed cache Hide memory latencies overlapping computation with memory transfers DMA asynchronous transfers - global local P2012 & OPENCL: OVERLAP DMA TRANSFERS AND COMPUTATION The best way to hide memory transfer latencies when programming for P2012 is to overlap computation with DMA transfers. This technique is based on software pipelining and double buffering (which reduce the total amount of available local or private memory ). 4 buffers are needed to implement such mechanism P2012 & OPENCL: DMA PRIMITIVES DMA transfer primitives: - Per work-item memory transfer async_work_item_copy (void *src, void *dst, size_t bytes, event_t event ); - Per work-group memory transfer async_work_group_copy (void *src, void *dst, size_t bytes, event_t event ); Copy the data for an entire work-group with a single DMA transfer Wait for DMA transfer to finish: wait_events(int num_events, event_t * events); wait_group_events(int num_events, event_t * events); P2012&OPENCL: OVERLAP DMA TRANSFERS AND COMPUTATION Write block N-1 local → global Process block N local → local Read block N+1 global → local - Assume each work-item works on multiple rows - Each work-item moves data in rows unsing asynchrnonous DMA transfers - Each cluster has 2 DMAs to allow parallel transfers in the two directions (local -> global, global->local)