Looking through the answers and comments on CUDA questions, and in the CUDA tag wiki, I see it is often suggested that the return status of every API call should checked for errors. The API documentation contains functions like cudaGetLastError
, cudaPeekAtLastError
, and cudaGetErrorString
, but what is the best way to put these together to reliably catch and report errors without requiring lots of extra code?
This question is related to
cuda
error-checking
The solution discussed here worked well for me. This solution uses built-in cuda functions and is very simple to implement.
The relevant code is copied below:
#include <stdio.h>
#include <stdlib.h>
__global__ void foo(int *ptr)
{
*ptr = 7;
}
int main(void)
{
foo<<<1,1>>>(0);
// make the host block until the device is finished with foo
cudaDeviceSynchronize();
// check for error
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess)
{
// print the CUDA error message and exit
printf("CUDA error: %s\n", cudaGetErrorString(error));
exit(-1);
}
return 0;
}
I used to be irked by this problem; and I used to have a macro-cum-wrapper-function solution just like in Talonmies and Jared's answers, but, honestly? It makes using the CUDA Runtime API even more ugly and C-like.
So I've approached this in a different and more fundamental way. For a sample of the result, here's part of the CUDA vectorAdd
sample - with complete error checking of every runtime API call:
// (... prepare host-side buffers here ...)
auto current_device = cuda::device::current::get();
auto d_A = cuda::memory::device::make_unique<float[]>(current_device, numElements);
auto d_B = cuda::memory::device::make_unique<float[]>(current_device, numElements);
auto d_C = cuda::memory::device::make_unique<float[]>(current_device, numElements);
cuda::memory::copy(d_A.get(), h_A.get(), size);
cuda::memory::copy(d_B.get(), h_B.get(), size);
// (... prepare a launch configuration here... )
cuda::launch(vectorAdd, launch_config,
d_A.get(), d_B.get(), d_C.get(), numElements
);
cuda::memory::copy(h_C.get(), d_C.get(), size);
// (... verify results here...)
Again - all potential errors are checked , and an exception if an error occurred (caveat: If the kernel caused some error after launch, it will be caught after the attempt to copy the result, not before; to ensure the kernel was successful you would need to check for error between the launch and the copy with a cuda::outstanding_error::ensure_none()
command).
The code above uses my
Thin Modern-C++ wrappers for the CUDA Runtime API library (Github)
Note that the exceptions carry both a string explanation and the CUDA runtime API status code after the failing call.
A few links to how CUDA errors are automagically checked with these wrappers:
talonmies' answer above is a fine way to abort an application in an assert
-style manner.
Occasionally we may wish to report and recover from an error condition in a C++ context as part of a larger application.
Here's a reasonably terse way to do that by throwing a C++ exception derived from std::runtime_error
using thrust::system_error
:
#include <thrust/system_error.h>
#include <thrust/system/cuda/error.h>
#include <sstream>
void throw_on_cuda_error(cudaError_t code, const char *file, int line)
{
if(code != cudaSuccess)
{
std::stringstream ss;
ss << file << "(" << line << ")";
std::string file_and_line;
ss >> file_and_line;
throw thrust::system_error(code, thrust::cuda_category(), file_and_line);
}
}
This will incorporate the filename, line number, and an English language description of the cudaError_t
into the thrown exception's .what()
member:
#include <iostream>
int main()
{
try
{
// do something crazy
throw_on_cuda_error(cudaSetDevice(-1), __FILE__, __LINE__);
}
catch(thrust::system_error &e)
{
std::cerr << "CUDA error after cudaSetDevice: " << e.what() << std::endl;
// oops, recover
cudaSetDevice(0);
}
return 0;
}
The output:
$ nvcc exception.cu -run
CUDA error after cudaSetDevice: exception.cu(23): invalid device ordinal
A client of some_function
can distinguish CUDA errors from other kinds of errors if desired:
try
{
// call some_function which may throw something
some_function();
}
catch(thrust::system_error &e)
{
std::cerr << "CUDA error during some_function: " << e.what() << std::endl;
}
catch(std::bad_alloc &e)
{
std::cerr << "Bad memory allocation during some_function: " << e.what() << std::endl;
}
catch(std::runtime_error &e)
{
std::cerr << "Runtime error during some_function: " << e.what() << std::endl;
}
catch(...)
{
std::cerr << "Some other kind of error during some_function" << std::endl;
// no idea what to do, so just rethrow the exception
throw;
}
Because thrust::system_error
is a std::runtime_error
, we can alternatively handle it in the same manner of a broad class of errors if we don't require the precision of the previous example:
try
{
// call some_function which may throw something
some_function();
}
catch(std::runtime_error &e)
{
std::cerr << "Runtime error during some_function: " << e.what() << std::endl;
}
Source: Stackoverflow.com