Suppose a 9800GT GPU:
https://www.tutorialspoint.com/cuda/cuda_threads.htm
A block cannot have more active threads than 512 therefore __syncthreads
can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:
func1();
__syncthreads();
func2();
__syncthreads();
then the kernel must run twice and the order of execution will be:
Note:
The main point is __syncthreads
is a block-wide operation and it does not synchronize all threads.
I'm not sure about the exact number of threads that __syncthreads
can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.
Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads
.
I would greatly appreciate if someone test this again for a more accurate piece of information.