TL;DR: Error-Checking Preprocessor Macros for CUDA Fortran and CUDA C/C++. They are also on Github.

When calling functions from the CUDA Runtime API 1, usually an error code is returned. Although this gets ignored by many (most?) of the users, it can give great insight into the wrong-doings of your code.

Error handling is something omitted regularly for smaller code bases. For the CUDA errors, the reason might be in the additional lines of code, cluttering the more directly content-related API calls, or simple laziness.

But fear not! With preprocessor macros, there’s just little overhead to include error handling in your CUDA code.

Here are macros for CUDA C(++) and CUDA Fortran. See also the notes on error checking of kernels at the end.

Macros

C++, C

I do not know who initially came up with the idea. It’s on the NVIDIA devblogs, in a Gist, and also in a lot of the codes of my colleagues. I modified the usual snippet a bit, though, to create what I think is the most informative and concise representation of an error call.

#define CUDA_CALL( call )               \
{                                       \
cudaError_t result = call;              \
if ( cudaSuccess != result )            \
    std::cerr << "CUDA error " << result << " in " << __FILE__ << ":" << __LINE__ << ": " << cudaGetErrorString( result ) << " (" << #call << ")" << std::endl;  \
}

This assumes that iostream is loaded. For C, replace the std::cerr << std::endl statement with fprintf(stderr, "CUDA error %i in %s …", result, __FILE__, …).

Use it by wrapping a plain API call into it:

CUDA_CALL( cudaMalloc( (void**)&ad, csize ); )

It will print one line per error, giving the file name and the line number of the error, the raw error code and its explained string; and it will print the actual call (#call). An erroneous call will then look like

CUDA error 38 in hello-world.cu:50: no CUDA-capable device is detected (cudaMalloc( (void**)&bd, isize );)

CUDA Fortran

Since CUDA Fortran is only available through the PGI Fortran compiler, the following is true only for this compiler (especially with regards to the preprocessor and column width). Note: If you find more elegant solutions to the problems discussed in the following, let me know! I still have a lot to learn in the depths that is Fortran.

In general, there is no limitation in using a similar macro in CUDA Fortran code compared to the CUDA C version. But: Column width. Also when using modern Fortran 90 (.F90)2, PGI’s Fortran compiler only allows for lines with a width of 256 characters. And because the preprocessor is not able to propagate line breaks into the target source, the error-check-augmented resulting line will be quite long. If you run into a line-too-long error, consider using shorter variables, which is ugly and horrible and arrrr, Fortran, but that’s just the way it is.3 Another workaround would be the one explained for kernels later on. The line length is also the reason I opted for removing whitespace and non-descriptive variables. Sorry.

Macro:

#define CUDA_SUCCESS 0
#define CUDA_CALL__(e,fmt,c) \
e=c; \
if(e/=CUDA_SUCCESS) \
write(*,fmt) "CUDA Error ",e," in ",__FILE__,":",__LINE__,": ",trim(cudaGetErrorString(e))," (",#c,")"
#define CUDA_CALL(c) CUDA_CALL__(gpuStatus,fmt,c)

The macro lives best closely together with a module which provides gpuStatus and the format string fmt. It can then be used in any instrumented routine/program with a use debug:

module debug
    character(len=27) :: fmt = "(A,I0,A,A,A,I0,A,A,A,A,A,A)"
    integer :: gpuStatus
end module debug

CUDA_CALL is a shortcut to CUDA_CALL__. The latter is in case one wants to use a different status variable (for reuse or explicit checking) or a different format string.

Error Checking on Kernels

Kernels do not have a return value so the usual means of error checking will not work on them. Instead, use a macro-wrapped cudaGetLastError()4 directly after the kernel launch, plus a wrapped cudaDeviceSynchronize() if there aren’t any other subsequent API calls or to help structure error messages.

CUDA_CALL( cudaGetLastError(); )
CUDA_CALL( cudaDeviceSynchronize(); )

This method works in all cases where the error code can not be handled directly, e.g. if the Fortran line is too long.

  1. Although all of the written is true for the CUDA Driver API, I will refer to the Runtime API, since this is the more commonly used method of access to the GPU. 

  2. Make sure to give your filenames a capital F in F90 to include preprocessing on the file. 

  3. On StackOverflow, user talonmies adds an additional routine to reduce the length of the preprocessor macro. This should affect line-too-longs in Fortran beneficially, but comes with other caveats. It could be worth to do, though, if line-too-longs are a regular problem. 

  4. This StackOverflow answers uses cudaPeekAtLastError(). But as far as I see it the result should be equal.