onnxruntime
58a87dc1 - Add CUDA Graph support for the CUDA plugin EP (#28002)

Commit
2 days ago
Add CUDA Graph support for the CUDA plugin EP (#28002) ## Description This PR brings CUDA graph capture/replay to the CUDA plugin execution provider so plugin-based CUDA deployments can get the same reduced CPU launch overhead that the in-tree CUDA EP already supports. It also adds the ORT framework and plugin-C-API plumbing needed to let graph-enabled plugin EPs participate safely in warmup, capture, and replay, while preserving compatibility with older plugins through version-gated fallbacks. ## Summary of Changes ### CUDA plugin EP runtime and allocator integration | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` | Implements plugin-side graph capture lifecycle callbacks, per-thread graph context management, graph replay, and stream selection for graph-enabled runs. | | `onnxruntime/core/providers/cuda/plugin/cuda_ep.h` | Adds CUDA graph configuration/state to the plugin EP, including per-thread graph context ownership. | | `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.cc` | Adds `CudaGraphSet`/`CudaGraphManager` to own captured graphs and coordinate warmup, capture, and replay by annotation ID. | | `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.h` | Declares the new graph manager types and graph-related constants. | | `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc` | Adds external-stream wrapping so graph-enabled runs can reuse the thread’s graph stream without taking ownership of it. | | `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h` | Declares the external-stream initialization path and stream ownership tracking. | | `onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc` | Parses `enable_cuda_graph` and `min_num_runs_before_cuda_graph_capture` provider/session options for the plugin EP. | | `onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc` | Updates allocator behavior needed for CUDA native mempool compatibility during graph capture/replay. | | `onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h` | Adjusts plugin kernel/device helpers used by the graph-enabled execution path. | | `onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h` | Adds supporting helpers used by the plugin CUDA graph flow. | ### ORT framework and plugin API support for graph replay | File | Change | |------|--------| | `include/onnxruntime/core/session/onnxruntime_ep_c_api.h` | Documents and extends the plugin EP contract for graph-enabled runs, including replay behavior relative to `OnRunStart`/`OnRunEnd`. | | `include/onnxruntime/core/framework/execution_provider.h` | Adds graph-capture node-assignment policy support to the execution provider interface. | | `onnxruntime/core/session/inference_session.cc` | Generalizes the session replay path and warmup/capture retry loop so ORT can drive graph replay for graph-capable EPs. | | `onnxruntime/core/session/inference_session.h` | Updates replay-related messaging and supporting declarations for the new run flow. | | `onnxruntime/core/framework/session_state.cc` | Makes device-stream collection reuse thread-affine so warmup/capture/replay reuse stays on the owning thread. | | `onnxruntime/core/framework/session_state.h` | Adds supporting state for the thread-affine stream collection pool. | | `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc` | Bridges the new graph callbacks, hardens validation of plugin graph support, and exposes effective plugin provider options gathered from session config. | | `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h` | Stores provider options and declares the new accessor/graph bridge behavior. | | `onnxruntime/core/providers/webgpu/webgpu_execution_provider.h` | Aligns graph-capture policy support with the new execution-provider interface. | | `onnxruntime/core/providers/js/js_execution_provider.h` | Aligns graph-capture policy support with the new execution-provider interface. | ### Tests and validation coverage | File | Change | |------|--------| | `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py` | Adds end-to-end CUDA graph tests for warmup/capture/replay, replay after input updates, CUDA mempool mode, multiple graph annotation IDs, multi-GPU/device-id coverage, and a simple Add model. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_graph_for_cuda_plugin.md` | Adds a dedicated design/implementation document covering architecture, lifecycle, allocator interaction, concurrency, and verification guidance. | | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | Updates the broader plugin EP design doc to reflect that CUDA graph support is implemented and documents the framework-level changes. | | `docs/cuda_plugin_ep/QUICK_START.md` | Updates quick-start/testing guidance and removes the outdated “no CUDA Graph support” limitation. | ## Testing - Build ONNX Runtime with `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`, install the generated wheel, and deploy the CUDA plugin shared library as described in `docs/cuda_plugin_ep/QUICK_START.md`. - Run `python onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`. - Pay particular attention to the new CUDA graph scenarios in that suite: warmup/capture/replay, replay after in-place input updates, CUDA mempool mode, multiple `gpu_graph_id` captures, and the second-device path when multiple GPUs are available. - Verify backward compatibility by confirming older plugins still load safely through the version-gated graph callback bridge, and that graph-disabled runs continue through the normal execution path. ## Motivation and Context The CUDA plugin EP exists to decouple CUDA EP delivery from core ONNX Runtime releases, but that model only works well if important runtime optimizations are also available through the plugin path. CUDA graph replay is one of the highest-value CUDA execution optimizations because it eliminates repeated kernel-launch overhead after capture, especially for steady-state inference workloads. Supporting that in the plugin EP required more than adding plugin-local capture code. ORT also needed a framework-level replay flow that works for plugin EPs, a plugin C API contract for graph support and node-assignment policy, and thread-affine stream reuse so captured graph resources and stream wrappers are not reused across unrelated threads. This PR packages those pieces together and documents the resulting behavior for future plugin EP work. It also depends on earlier plugin allocator work so warmup can stabilize allocations before capture begins. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description)
Author
Parents
Loading