How do I choose grid and block dimensions for CUDA kernels

Question

This is a question about how to determine the CUDA grid  block and thread sizes  This is an additional question to the one posted here   Following this link  the answer from talonmies contains a code snippet  see below   I don t understand the comment  value usually chosen by tuning and hardware constraints    I haven t found a good explanation or clarification that explains this in the CUDA documentation  In summary  my question is how to determine the optimal blocksize  number of threads  given the following code   const int n   128   1024  int blocksize   512     value usually chosen by tuning and hardware constraints int nblocks   n   nthreads     value determine by block size and total work madd lt  lt  lt nblocks blocksize gt  gt  gt mAdd A B C n

User · Answer

The answers above point out how the block size can impact performance and suggest a common heuristic for its choice based on occupancy maximization. Without wanting to provide the criterion to choose the block size, it would be worth mentioning that CUDA 6.5 (now in Release Candidate version) includes several new runtime functions to aid in occupancy calculations and launch configuration, see

CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

One of the useful functions is cudaOccupancyMaxPotentialBlockSize which heuristically calculates a block size that achieves the maximum occupancy. The values provided by that function could be then used as the starting point of a manual optimization of the launch parameters. Below is a little example.

#include <stdio.h>

/************************/
/* TEST KERNEL FUNCTION */
/************************/
__global__ void MyKernel(int *a, int *b, int *c, int N) 
{ 
    int idx = threadIdx.x + blockIdx.x * blockDim.x; 

    if (idx < N) { c[idx] = a[idx] + b[idx]; } 
} 

/********/
/* MAIN */
/********/
void main() 
{ 
    const int N = 1000000;

    int blockSize;      // The launch configurator returned block size 
    int minGridSize;    // The minimum grid size needed to achieve the maximum occupancy for a full device launch 
    int gridSize;       // The actual grid size needed, based on input size 

    int* h_vec1 = (int*) malloc(N*sizeof(int));
    int* h_vec2 = (int*) malloc(N*sizeof(int));
    int* h_vec3 = (int*) malloc(N*sizeof(int));
    int* h_vec4 = (int*) malloc(N*sizeof(int));

    int* d_vec1; cudaMalloc((void**)&d_vec1, N*sizeof(int));
    int* d_vec2; cudaMalloc((void**)&d_vec2, N*sizeof(int));
    int* d_vec3; cudaMalloc((void**)&d_vec3, N*sizeof(int));

    for (int i=0; i<N; i++) {
        h_vec1[i] = 10;
        h_vec2[i] = 20;
        h_vec4[i] = h_vec1[i] + h_vec2[i];
    }

    cudaMemcpy(d_vec1, h_vec1, N*sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_vec2, h_vec2, N*sizeof(int), cudaMemcpyHostToDevice);

    float time;
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, MyKernel, 0, N); 

    // Round up according to array size 
    gridSize = (N + blockSize - 1) / blockSize; 

    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);
    printf("Occupancy calculator elapsed time:  %3.3f ms \n", time);

    cudaEventRecord(start, 0);

    MyKernel<<<gridSize, blockSize>>>(d_vec1, d_vec2, d_vec3, N); 

    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);
    printf("Kernel elapsed time:  %3.3f ms \n", time);

    printf("Blocksize %i\n", blockSize);

    cudaMemcpy(h_vec3, d_vec3, N*sizeof(int), cudaMemcpyDeviceToHost);

    for (int i=0; i<N; i++) {
        if (h_vec3[i] != h_vec4[i]) { printf("Error at i = %i! Host = %i; Device = %i\n", i, h_vec4[i], h_vec3[i]); return; };
    }

    printf("Test passed\n");

}

EDIT

The cudaOccupancyMaxPotentialBlockSize is defined in the cuda_runtime.h file and is defined as follows:

template<class T>
__inline__ __host__ CUDART_DEVICE cudaError_t cudaOccupancyMaxPotentialBlockSize(
    int    *minGridSize,
    int    *blockSize,
    T       func,
    size_t  dynamicSMemSize = 0,
    int     blockSizeLimit = 0)
{
    return cudaOccupancyMaxPotentialBlockSizeVariableSMem(minGridSize, blockSize, func, __cudaOccupancyB2DHelper(dynamicSMemSize), blockSizeLimit);
}

The meanings for the parameters is the following

minGridSize     = Suggested min grid size to achieve a full machine launch.
blockSize       = Suggested block size to achieve maximum occupancy.
func            = Kernel function.
dynamicSMemSize = Size of dynamically allocated shared memory. Of course, it is known at runtime before any kernel launch. The size of the statically allocated shared memory is not needed as it is inferred by the properties of func.
blockSizeLimit  = Maximum size for each block. In the case of 1D kernels, it can coincide with the number of input elements.

Note that, as of CUDA 6.5, one needs to compute one's own 2D/3D block dimensions from the 1D block size suggested by the API.

Note also that the CUDA driver API contains functionally equivalent APIs for occupancy calculation, so it is possible to use cuOccupancyMaxPotentialBlockSize in driver API code in the same way shown for the runtime API in the example above.

User · Answer

There are two parts to that answer  I wrote it   One part is easy to quantify  the other is more empirical  Hardware Constraints  This is the easy to quantify part  Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have  If you exceed any of these  your kernel will never run  They can be roughly summarized as   Each block cannot have more than 512 1024 threads in total  Compute Capability 1 x or 2 x and later respectively  The maximum dimensions of each block are limited to  512 512 64   1024 1024 64   Compute 1 x 2 x or later  Each block cannot consume more than 8k 16k 32k 64k 32k 64k 32k 64k 32k 64k registers total  Compute 1 0 1 1 1 2 1 3 2 x- 3 0 3 2 3 5-5 2 5 3 6-6 1 6 2 7 0  Each block cannot consume more than 16kb 48kb 96kb of shared memory  Compute 1 x 2 x-6 2 7 0   If you stay within those limits  any kernel you can successfully compile will launch without error  Performance Tuning  This is the empirical part  The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware  How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling  But again  very roughly summarized   The number of threads per block should be a round multiple of the warp size  which is 32 on all current hardware  Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput  The orthodox approach here is to try achieving optimal hardware occupancy  what Roger Dahl s answer is referring to    The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer  There are people writing PhD theses around the quantitative analysis of aspects of the problem  see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is   At the entry level  you should mostly be aware that the block size you choose  within the range of legal block sizes defined by the constraints above  can and does have a impact on how fast your code will run  but it depends on the hardware you have and the code you are running  By benchmarking  you will probably find that most non-trivial code has a  quot sweet spot quot  in the 128-512 threads per block range  but it will require some analysis on your part to find where that is  The good news is that because you are working in multiples of the warp size  the search space is very finite and the best configuration for a given piece of code relatively easy to find

User · Answer

The blocksize is usually selected to maximize the  occupancy   Search on CUDA Occupancy for more information  In particular  see the CUDA Occupancy Calculator spreadsheet

[performance] How do I choose grid and block dimensions for CUDA kernels?

Examples related to performance

Examples related to optimization

Examples related to cuda

Examples related to gpu

Examples related to nvidia