Understanding CUDA grid dimensions block dimensions and threads organization simple explanation

Question

How are threads organized to be executed by a GPU

User · Accepted Answer

Hardware  If a GPU device has  for example  4 multiprocessing units  and they can run  768 threads each  then at a given moment no more than 4 768 threads will be really running in parallel  if you planned more threads  they will be waiting their turn    Software  threads are organized in blocks  A block is executed by a multiprocessing unit  The threads of a block can be indentified  indexed  using 1Dimension x   2Dimensions  x y  or 3Dim indexes  x y z  but in any case xyz  lt   768 for our example  other restrictions apply to x y z  see the guide and your device capability    Obviously  if you need more than those 4 768 threads you need more than 4 blocks  Blocks may be also indexed 1D  2D or 3D  There is a queue of blocks waiting to enter the GPU  because  in our example  the GPU has 4 multiprocessors and only 4 blocks are being executed simultaneously    Now a simple case  processing a 512x512 image  Suppose we want one thread to process one pixel  i j    We can use blocks of 64 threads each  Then we need 512 512 64   4096 blocks  so to have 512x512 threads   4096 64   It s common to organize  to make indexing the image easier  the threads in 2D blocks having blockDim   8 x 8  the 64 threads per block    I prefer to call it threadsPerBlock   dim3 threadsPerBlock 8  8       64 threads   and 2D gridDim   64 x 64 blocks  the 4096 blocks needed   I prefer to call it numBlocks   dim3 numBlocks imageWidth threadsPerBlock x      for instance 512 8   64                 imageHeight threadsPerBlock y      The kernel is launched like this   myKernel  lt  lt  lt numBlocks threadsPerBlock gt  gt  gt      params for the kernel function                Finally  there will be something like  a queue of 4096 blocks   where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed   In the kernel the pixel  i j  to be processed by a thread is calculated this way   uint i    blockIdx x   blockDim x    threadIdx x  uint j    blockIdx y   blockDim y    threadIdx y

User · Answer

Suppose a 9800GT GPU    it has 14 multiprocessors  SM  each SM has 8 thread-processors  AKA stream-processors  SP or cores  allows up to 512 threads per block warpsize is 32  which means each of the 14x8 112 thread-processors can schedule up to 32 threads    https   www tutorialspoint com cuda cuda threads htm  A block cannot have more active threads than 512 therefore   syncthreads can only synchronize limited number of threads  i e  If you execute the following with 600 threads   func1      syncthreads    func2      syncthreads      then the kernel must run twice and the order of execution will be    func1 is executed for the first 512 threads func2 is executed for the first 512 threads func1 is executed for the remaining threads func2 is executed for the remaining threads   Note   The main point is   syncthreads is a block-wide operation and it does not synchronize all threads     I m not sure about the exact number of threads that   syncthreads can synchronize  since you can create a block with more than 512 threads and let the warp handle the scheduling  To my understanding it s more accurate to say  func1 is executed at least for the first 512 threads   Before I edited this answer  back in 2010  I measured 14x8x32 threads were synchronized using   syncthreads   I would greatly appreciate if someone test this again for a more accurate piece of information

[cuda] Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Hardware

Software

Now a simple case: processing a 512x512 image

Examples related to cuda

Examples related to nvidia