kernel1<<<…>>>() GPU Time CPU Thread Block Grid 1 B0 B1 B2 B3 B4 B5 Serial code Thread synchronization kernel2<<<…>>>() Serial code Grid 2 B0 B1 B2 B3 B4 B5 CPU GPU Memory Transfer Host Memory Device Memory Host Pinned Memory Shared Memory CPU/GPU GPU Time CPU Serial code kernel1<<<…>>>() Grid 1 B0 B1 B2 B3 B4 B5 Pinned Memory Serial code GPU Time CPU kernel1<<<…>>>() Grid 1 CUQU::push() B0 B1 B2 B3 B4 B5 CUQU::fetch() Serial code Serial code Marco Esposito Micenin Pinned Memory A/A 2010/2011 B0 B1 B2 B3 B4 B5 4/21 GPU Time CPU kernel1<<<…>>>() Grid 1 CUQU::fetch() barrier_wait() Serial code Pinned Memory Serial code barrier_wait() Offload Time Time Computation Time Synchronization Time Sincronizzazione thread GPU KSM-implicit 1 for(…) { kernel<<<…>>>(); } Time KSM-explicit for(…) { kernel<<<…>>>(); cudaThreadSync(); } 2 Time CSM-oneloop & CSM-lockfree barrier_wait() __global__ void csm_kernel() { for(…) { compute(); barrier_wait(); } } 3 Time Marco Esposito Micenin A/A 2010/2011 7/21