onnxruntime
3b022ecb - [CUDA] Support user compute stream with CUDA graph in CUDA plugin EP (#29221)

Commit
9 days ago
[CUDA] Support user compute stream with CUDA graph in CUDA plugin EP (#29221) ## Description The CUDA plugin EP previously rejected combining a user-provided compute stream (`user_compute_stream`) with CUDA graph capture (`enable_cuda_graph`), returning `ORT_INVALID_ARGUMENT`. This PR removes that restriction so the two options can be used together: when both are set, graph capture and replay run on the user-owned stream (the same stream the kernels are issued to), matching the bundled (non-plugin) CUDA EP behavior. Several supporting fixes make capture on a shared stream stable and Memcpy-free. ## Summary of Changes ### Allow user stream + CUDA graph | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc) | Remove the validation that rejected `user_compute_stream` + `enable_cuda_graph` together. | | [onnxruntime/core/providers/cuda/plugin/cuda_ep.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep.cc) | `PerThreadContext` accepts an optional external graph stream. When both options are set it captures/replays on the user stream and does **not** create or destroy it (the user owns its lifetime); otherwise it owns a dedicated graph stream as before. | ### Stable, Memcpy-free CUDA graph capture | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h](onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h) | Route kernel scratch/workspace allocations through the EP allocator (BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup the arena reaches steady state, so the capture run serves scratch from already-reserved chunks and the device free-memory footprint stays stable — required for correct capture. Matches the built-in CUDA EP. | | [onnxruntime/core/providers/cuda/tensor/shape_op.cc](onnxruntime/core/providers/cuda/tensor/shape_op.cc) | Add an adapter-based `Shape` kernel under `#ifdef BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`. Registering `Shape` on the EP keeps it off the CPU EP and avoids the Memcpy nodes that would otherwise break CUDA graph capture. | | [cmake/onnxruntime_providers_cuda_plugin.cmake](cmake/onnxruntime_providers_cuda_plugin.cmake) | Stop excluding `shape_op.cc` from the plugin build so the adapter-based `Shape` kernel is compiled in. | ### Null-allocator fallback in PrePack (plugin boundary) In the plugin build the `AllocatorPtr` passed to `PrePack` can arrive null across the library boundary. Each kernel now falls back to its own default-memory allocator (`Info().GetAllocator(OrtMemTypeDefault)`), which is always valid. - [onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc](onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc) - [onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc](onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc) - [onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc](onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc) ### Misc - [onnxruntime/core/framework/session_state.cc](onnxruntime/core/framework/session_state.cc) — wrap a long line (no behavior change). ## Testing - New test: [onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc](onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc) covering: 1. Session creation succeeds with both `user_compute_stream` and `enable_cuda_graph` set (regression for the removed validation). 2. Capture + replay on the user stream produce correct results. 3. Replay after an in-place input update on the user stream is correct. - Tests are gated on `ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP` and skip gracefully when no CUDA device or plugin library is available. ## Motivation and Context Users that drive ORT from their own CUDA stream (e.g. to interleave ORT inference with their own kernels) previously could not also benefit from CUDA graph capture on the plugin EP. This change brings the plugin EP to parity with the bundled CUDA EP for that workflow. ## Checklist - [x] Tests added/updated - [x] No breaking changes (relaxes a previously rejected option combination) - [ ] Documentation updated (if applicable)
Author
Parents
Loading