OPENCL
Advanced Features
Giuseppe Tagliavini
[email protected]
Christian Pinto
[email protected]
Luca Benini
[email protected]
OUTLINE

Profiling

Pinned memory

Local buffering

Differences between P2012 and GPUs

Smart memory management: DMA
PROFILING (1/3)
Enable profiling on the device command queue:
cl_command_queue queue;
cl_int error;
queue = clCreateCommandQueue(context,
device, CL_QUEUE_PROFILING_ENABLE, &error);
Enable profiling
Command
Queue
Context
Error handling
PROFILING (2/3)
cl_int error;
cl_ulong param_value;
error = clGetEventProfilingInfo(event,
param_name, param_value_size,
&param_value, &param_value_size_ret);
event
Specifies the event object
param_name
CL_PROFILING_COMMAND_QUEUED
CL_PROFILING_COMMAND_SUBMIT
CL_PROFILING_COMMAND_START
CL_PROFILING_COMMAND_END
param_value_size
sizeof(cl_ulong)
param_value
A pointer to memory where the result
being queried is returned (device
time counter, in nanoseconds)
param_value_size_ret
NULL
PROFILING (3/3)
On the host side, we can read the system clock:
#include <time.h>
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// Host code...
clock_gettime(CLOCK_MONOTONIC, &end);
unsigned long delta =
(end.tv_sec*10E-9 + end.tv_nsec) –
(start.tv_sec*10E-9 + start.tv_nsec);
PROFILING ON NVIDIA PLATFORMS
export OPENCL_PROFILE=1
./example
cat opencl_profile_0.log
# OPENCL_PROFILE_LOG_VERSION 2.0
# OPENCL_DEVICE 0 Tesla C2070
# OPENCL_CONTEXT 1
# TIMESTAMPFACTOR fffff6bbdff06bd0
method,gputime,cputime,occupancy
method=[ memcpyHtoDasync ] gputime=[ 1.472 ] cputime=[ 7.000 ]
method=[ memcpyHtoDasync ] gputime=[ 1.280 ] cputime=[ 4.000 ]
method=[ add ] gputime=[ 4.096 ] cputime=[ 11.000 ] occupancy=[ 0.333 ]
method=[ memcpyDtoHasync ] gputime=[ 3.712 ] cputime=[ 25.000 ]
OUTLINE

Profiling

Pinned memory

Local buffering

Differences between P2012 and GPUs

Smart memory management: DMA
PINNED MEMORY

Pinned memory is a region in host memory space
which is not pageble

Swapping is disabled

It enables faster host  device transfer times (DMA
cl_mem buffer;
buffer = clCreateBuffer(context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR, szBufferBytes, NULL, &error);

How can we access this memory area?
MEMORY MAPPING

clEnqueueMapBuffer enqueues a command to map a
buffer object into the host address space, and returns a
pointer to this mapped region
unsigned char * hData;
hData = (unsigned char *) clMapBuffer(queue, buffer,
CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, nbytes,
0, NULL, NULL, &error);
Offset, number of bytes

To unmap the memory region (safe for use by the device):
clUnmapMemObject(queue, buffer, hData
0, NULL, NULL);
OUTLINE

Profiling

Pinned memory

Local buffering

Differences between P2012 and GPUs

Smart memory management: DMA
CASE STUDY: MATRIX MULTIPLICATION
for(int i = 0; i < heightA; i++)
{
for(int j = 0; j < widthB; j++)
{
C[i][j] = 0;
for(int k = 0; k < widthA; k++)
C[i][j] += A[i][k] * B[k][j];
}
}
Using single dimensional arrays:
A[i][k]  A[i*widthA + k]
MATRIX MULTIPLICATION: KERNEL V1
kernel void simpleMultiply(global float* outputC,
int widthA, int heightA,
int widthB, int heightB,
global float* inputA,
global float* inputB)
{
int row = get_global_id(1);
int col = get_global_id(0);
NDRange size= [widthB, heightA]
float sum = 0.0f;
for(int i = 0; i < widthA; i++)
sum += inputA[row*widthA+i] * inputB[i*widthB+col];
outputC[row*widthB+col] = sum;
}
OPENCL MEMORY HIERARCHY
Faster
Smaller
Slower
Greater
MATRIX MULTIPLICATION: KERNEL V2
kernel void coalescedMultiply(global float* outputC,
int widthA, int heightA,
int widthB, int heightB,
global float* inputA,
global float* inputB)
{
local float aTile[BLOCK_SIZE][widthA];
int row = get_global_id(1); int col = get_global_id(0);
float sum = 0.0f;
int x = get_local_id(0); int y = get_local_id(1);
aTile[y][x] = a[row*widthA+x];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = 0; i < widthA; i++)
sum += aTile[y][i] * inputB[i*widthB+col];
outputC[row*widthB+col] = sum;
}
MATRIX MULTIPLICATION: KERNEL V2
kernel void coalescedMultiply(global float* outputC,
int widthA, int heightA,
int widthB, int heightB,
global float* inputA,
global float* inputB)
{
local float aTile[BLOCK_SIZE][widthA];
int row = get_global_id(1); int col = get_global_id(0);
float sum = 0.0f;
int x = get_local_id(0); int y = get_local_id(1);
aTile[y][x] = a[row*widthA+x];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = 0; i < widthA; i++)
1. Local memory access is FASTER
sum += aTile[y][i] * inputB[i*widthB+col];
2. Transfers ot adjacent memory addresses are COALESCED
= BLOCK_SIZE)
sum;
3. outputC[row*widthB+col]
Work-group size: (widthA,
}4. Synchronization using local barrier
MATRIX MULTIPLICATION: KERNEL V3
kernel void coalescedMultiply(global float* outputC,
int widthA, int heightA,
int widthB, int heightB,
global float* inputA,
global float* inputB)
{
local float aTile[BLOCK_SIZE][BLOCK_SIZE];
…
for(int m = 0; m < widthA/BLOCK_SIZE; m++)
{
aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = 0; i < BLOCK_SIZE; i++)
sum += aTile[y][i] * inputB[i*widthB+col];
barrier(CLK_LOCAL_MEM_FENCE);
}
…
MATRIX MULTIPLICATION: KERNEL V3
kernel void coalescedMultiply(global float* outputC,
int widthA, int heightA,
int widthB, int heightB,
global float* inputA,
global float* inputB)
{
local float aTile[BLOCK_SIZE][BLOCK_SIZE];
…
Local memory usage is limited!!!
for(int m = 0; m < widthA/BLOCK_SIZE; m++)
{
aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = 0; i < BLOCK_SIZE; i++)
sum += aTile[y][i] * inputB[i*widthB+col];
barrier(CLK_LOCAL_MEM_FENCE);
}
…
Work-group size: (BLOCK_SIZE, BLOCK_SIZE)
OUTLINE

Profiling

Pinned memory

Local buffering

Differences between P2012 and GPUs

Smart memory management: DMA
P2012 “COMPUTING” AS OPENCL DEVICE
OpenCL Conceptual
Architecture
P2012 Architecture
Private
Memory
Private
Memory
Private
Memory
Private
Memory
Proc 1
ProcM
Proc1
Proc M
Compute Unit 1
Compute Unit N
Local Memory
Local Memory
DMA
Constant Data Cache
DMA
Global Memory
• Scalable programming model
• Supports SPMD model
• Supports Task parallelism
• Covers complex memory hierarchy
• Support async memory transfers
Proc 1
ProcM
Cluster 1
L1 256K shared
Proc1
Proc M
Cluster N
L1 256K shared
DM
A
DM
A
L3 External Memory
• Scalable architecture (cluster based)
• Supports SPMD with 0-cost branch
divergence
• Supports Task parallelism
• Shared Local memory
• 1D/2D DMA engines
• Hardware synchronizer
P2012 & OPENCL: KERNEL LEVEL PARALLELISM
With P2012 it is possible to
implement more complex OpenCL
task graph (more complex than
GPUs). Both task-level and datalevel (ND-Range) are possible
P2012 OpenCL runtime does not
accept more than 16 work-items per
work group when creating an NDRange.
This because having more work
items than PEs would end in lots
of context switches, which are
really expensive in this
architecture.
P2012 & GPUS: DIFFERENCES

P2012 Cores are mono-threaded while GPUs cores
are highly multithreaded



In GPU programming memory latencies are hidden running
a massive battery of threads. GPUs have negligible task
scheduling overhead.
In P2012 memory latencies are hidden by DMA
asynchronous copies because context switches are
expensive.
GPU cores execute in lock-step (SIMT fashion)
All threads in a warp execute the same instruction flow,
diverging threads cause an important performance loss.
 P2012 Cluster’s PEs can execute different instruction flows
without affecting application performance.

P2012 & GPUS: PROGRAMMING STYLE
DIFFERENCES
Data
Data
work-group
16 work-item
work-group
16 work-item
P2012
2 clusters
16 PE per cluster
ND-Range
2 work-groups
16 work-items per WG
GPU
Several clusters
32 PE per cluster
ND-Range
- 1 work item per
data element
- number of WGs on
the size of WG
22
OUTLINE

Profiling

Pinned memory

Local buffering

Differences between P2012 and GPUs

Smart memory management: DMA
P2012&OPENCL: SMART MEMORY USAGE
Problem:
 External L3 Memory accesses have an high cost
(hundred cycles)
 P2012 cannot hide memory latencies with thread
scheduling
Solution:
 Use local memory as a user managed cache
 Hide memory latencies overlapping computation
with memory transfers
 DMA
asynchronous transfers - global
local
P2012 & OPENCL: OVERLAP DMA
TRANSFERS AND COMPUTATION
The best way to hide memory transfer latencies when programming for
P2012 is to overlap computation with DMA transfers. This technique is
based on software pipelining and double buffering (which reduce
the total amount of available local or private memory ).
4 buffers are needed
to implement such
mechanism
P2012 & OPENCL: DMA PRIMITIVES
DMA transfer primitives:
- Per work-item memory transfer
async_work_item_copy (void *src, void *dst, size_t
bytes, event_t event );
- Per work-group memory transfer
async_work_group_copy (void *src, void *dst, size_t
bytes, event_t event );
Copy the data for an entire work-group with a single DMA transfer
Wait for DMA transfer to finish:
wait_events(int num_events, event_t * events);
wait_group_events(int num_events, event_t * events);
P2012&OPENCL: OVERLAP DMA
TRANSFERS AND COMPUTATION
Write block N-1
local → global
Process block N
local → local
Read block N+1
global → local
- Assume each work-item works on multiple rows
- Each work-item moves data in rows unsing asynchrnonous
DMA transfers
- Each cluster has 2 DMAs to allow parallel transfers in the
two directions (local -> global, global->local)
Scarica

Profiling