[CUDA] Support user compute stream with CUDA graph in CUDA plugin EP (#29221)
## Description
The CUDA plugin EP previously rejected combining a user-provided compute
stream
(`user_compute_stream`) with CUDA graph capture (`enable_cuda_graph`),
returning
`ORT_INVALID_ARGUMENT`. This PR removes that restriction so the two
options can
be used together: when both are set, graph capture and replay run on the
user-owned stream (the same stream the kernels are issued to), matching
the
bundled (non-plugin) CUDA EP behavior. Several supporting fixes make
capture on a
shared stream stable and Memcpy-free.
## Summary of Changes
### Allow user stream + CUDA graph
| File | Change |
|------|--------|
|
[onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc)
| Remove the validation that rejected `user_compute_stream` +
`enable_cuda_graph` together. |
|
[onnxruntime/core/providers/cuda/plugin/cuda_ep.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep.cc)
| `PerThreadContext` accepts an optional external graph stream. When
both options are set it captures/replays on the user stream and does
**not** create or destroy it (the user owns its lifetime); otherwise it
owns a dedicated graph stream as before. |
### Stable, Memcpy-free CUDA graph capture
| File | Change |
|------|--------|
|
[onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h](onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h)
| Route kernel scratch/workspace allocations through the EP allocator
(BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup
the arena reaches steady state, so the capture run serves scratch from
already-reserved chunks and the device free-memory footprint stays
stable — required for correct capture. Matches the built-in CUDA EP. |
|
[onnxruntime/core/providers/cuda/tensor/shape_op.cc](onnxruntime/core/providers/cuda/tensor/shape_op.cc)
| Add an adapter-based `Shape` kernel under `#ifdef
BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`.
Registering `Shape` on the EP keeps it off the CPU EP and avoids the
Memcpy nodes that would otherwise break CUDA graph capture. |
|
[cmake/onnxruntime_providers_cuda_plugin.cmake](cmake/onnxruntime_providers_cuda_plugin.cmake)
| Stop excluding `shape_op.cc` from the plugin build so the
adapter-based `Shape` kernel is compiled in. |
### Null-allocator fallback in PrePack (plugin boundary)
In the plugin build the `AllocatorPtr` passed to `PrePack` can arrive
null across
the library boundary. Each kernel now falls back to its own
default-memory
allocator (`Info().GetAllocator(OrtMemTypeDefault)`), which is always
valid.
-
[onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc](onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc)
-
[onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc](onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc)
-
[onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc](onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc)
### Misc
-
[onnxruntime/core/framework/session_state.cc](onnxruntime/core/framework/session_state.cc)
— wrap a long line (no behavior change).
## Testing
- New test:
[onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc](onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc)
covering:
1. Session creation succeeds with both `user_compute_stream` and
`enable_cuda_graph` set (regression for the removed validation).
2. Capture + replay on the user stream produce correct results.
3. Replay after an in-place input update on the user stream is correct.
- Tests are gated on `ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP` and skip
gracefully when no CUDA device or plugin library is available.
## Motivation and Context
Users that drive ORT from their own CUDA stream (e.g. to interleave ORT
inference
with their own kernels) previously could not also benefit from CUDA
graph capture
on the plugin EP. This change brings the plugin EP to parity with the
bundled
CUDA EP for that workflow.
## Checklist
- [x] Tests added/updated
- [x] No breaking changes (relaxes a previously rejected option
combination)
- [ ] Documentation updated (if applicable)