[cuda] Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Suppose a 9800GT GPU:

  • it has 14 multiprocessors (SM)
  • each SM has 8 thread-processors (AKA stream-processors, SP or cores)
  • allows up to 512 threads per block
  • warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)

https://www.tutorialspoint.com/cuda/cuda_threads.htm

A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:

func1();
__syncthreads();
func2();
__syncthreads();

then the kernel must run twice and the order of execution will be:

  1. func1 is executed for the first 512 threads
  2. func2 is executed for the first 512 threads
  3. func1 is executed for the remaining threads
  4. func2 is executed for the remaining threads

Note:

The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.


I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.

Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.

I would greatly appreciate if someone test this again for a more accurate piece of information.