[CUDA] Add comprehensive CUDA API call tracing for debugging
Add detailed logging infrastructure for CUDA adapter to help diagnose
multi-GPU issues. Controlled by UR_CUDA_CALL_TRACE=1 environment variable.
Features:
- Automatic logging of all CUDA API calls through UR_CHECK_ERROR macro
- Detailed parameter logging for key operations (kernel launch, memcpy)
- Context switch tracking with addresses
- Success/error result logging
- Zero overhead when disabled (compile-time check)
Logged operations:
- cuCtxGetCurrent/cuCtxSetCurrent with context addresses
- cuLaunchKernel with grid, block, shared memory, stream
- cuMemcpyAsync/cuMemcpyDtoDAsync with src/dst/size
- All other cuXxx functions with full call signature
Usage:
UR_CUDA_CALL_TRACE=1 ./test-binary
This enables deep debugging of multi-GPU synchronization, context
management, and memory operations without modifying test code.