Add CUDA Graph support for the CUDA plugin EP (#28002)
## Description
This PR brings CUDA graph capture/replay to the CUDA plugin execution
provider so plugin-based CUDA deployments can get the same reduced CPU
launch overhead that the in-tree CUDA EP already supports. It also adds
the ORT framework and plugin-C-API plumbing needed to let graph-enabled
plugin EPs participate safely in warmup, capture, and replay, while
preserving compatibility with older plugins through version-gated
fallbacks.
## Summary of Changes
### CUDA plugin EP runtime and allocator integration
| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` | Implements
plugin-side graph capture lifecycle callbacks, per-thread graph context
management, graph replay, and stream selection for graph-enabled runs. |
| `onnxruntime/core/providers/cuda/plugin/cuda_ep.h` | Adds CUDA graph
configuration/state to the plugin EP, including per-thread graph context
ownership. |
| `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.cc` | Adds
`CudaGraphSet`/`CudaGraphManager` to own captured graphs and coordinate
warmup, capture, and replay by annotation ID. |
| `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.h` |
Declares the new graph manager types and graph-related constants. |
| `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc` | Adds
external-stream wrapping so graph-enabled runs can reuse the thread’s
graph stream without taking ownership of it. |
| `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h` |
Declares the external-stream initialization path and stream ownership
tracking. |
| `onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc` | Parses
`enable_cuda_graph` and `min_num_runs_before_cuda_graph_capture`
provider/session options for the plugin EP. |
|
`onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc`
| Updates allocator behavior needed for CUDA native mempool
compatibility during graph capture/replay. |
| `onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h` |
Adjusts plugin kernel/device helpers used by the graph-enabled execution
path. |
| `onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h` | Adds
supporting helpers used by the plugin CUDA graph flow. |
### ORT framework and plugin API support for graph replay
| File | Change |
|------|--------|
| `include/onnxruntime/core/session/onnxruntime_ep_c_api.h` | Documents
and extends the plugin EP contract for graph-enabled runs, including
replay behavior relative to `OnRunStart`/`OnRunEnd`. |
| `include/onnxruntime/core/framework/execution_provider.h` | Adds
graph-capture node-assignment policy support to the execution provider
interface. |
| `onnxruntime/core/session/inference_session.cc` | Generalizes the
session replay path and warmup/capture retry loop so ORT can drive graph
replay for graph-capable EPs. |
| `onnxruntime/core/session/inference_session.h` | Updates
replay-related messaging and supporting declarations for the new run
flow. |
| `onnxruntime/core/framework/session_state.cc` | Makes device-stream
collection reuse thread-affine so warmup/capture/replay reuse stays on
the owning thread. |
| `onnxruntime/core/framework/session_state.h` | Adds supporting state
for the thread-affine stream collection pool. |
| `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc`
| Bridges the new graph callbacks, hardens validation of plugin graph
support, and exposes effective plugin provider options gathered from
session config. |
| `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h` |
Stores provider options and declares the new accessor/graph bridge
behavior. |
| `onnxruntime/core/providers/webgpu/webgpu_execution_provider.h` |
Aligns graph-capture policy support with the new execution-provider
interface. |
| `onnxruntime/core/providers/js/js_execution_provider.h` | Aligns
graph-capture policy support with the new execution-provider interface.
|
### Tests and validation coverage
| File | Change |
|------|--------|
| `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py` | Adds
end-to-end CUDA graph tests for warmup/capture/replay, replay after
input updates, CUDA mempool mode, multiple graph annotation IDs,
multi-GPU/device-id coverage, and a simple Add model. |
### Documentation
| File | Change |
|------|--------|
| `docs/cuda_plugin_ep/cuda_graph_for_cuda_plugin.md` | Adds a dedicated
design/implementation document covering architecture, lifecycle,
allocator interaction, concurrency, and verification guidance. |
| `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | Updates the broader
plugin EP design doc to reflect that CUDA graph support is implemented
and documents the framework-level changes. |
| `docs/cuda_plugin_ep/QUICK_START.md` | Updates quick-start/testing
guidance and removes the outdated “no CUDA Graph support” limitation. |
## Testing
- Build ONNX Runtime with `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`,
install the generated wheel, and deploy the CUDA plugin shared library
as described in `docs/cuda_plugin_ep/QUICK_START.md`.
- Run `python
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`.
- Pay particular attention to the new CUDA graph scenarios in that
suite: warmup/capture/replay, replay after in-place input updates, CUDA
mempool mode, multiple `gpu_graph_id` captures, and the second-device
path when multiple GPUs are available.
- Verify backward compatibility by confirming older plugins still load
safely through the version-gated graph callback bridge, and that
graph-disabled runs continue through the normal execution path.
## Motivation and Context
The CUDA plugin EP exists to decouple CUDA EP delivery from core ONNX
Runtime releases, but that model only works well if important runtime
optimizations are also available through the plugin path. CUDA graph
replay is one of the highest-value CUDA execution optimizations because
it eliminates repeated kernel-launch overhead after capture, especially
for steady-state inference workloads.
Supporting that in the plugin EP required more than adding plugin-local
capture code. ORT also needed a framework-level replay flow that works
for plugin EPs, a plugin C API contract for graph support and
node-assignment policy, and thread-affine stream reuse so captured graph
resources and stream wrappers are not reused across unrelated threads.
This PR packages those pieces together and documents the resulting
behavior for future plugin EP work. It also depends on earlier plugin
allocator work so warmup can stabilize allocations before capture
begins.
## Checklist
- [x] Tests added/updated
- [x] Documentation updated (if applicable)
- [x] No breaking changes (or documented in description)