Implement CUDA EP Plugin profiling API (#28216)
This pull request adds support for CUPTI-based GPU profiling to the CUDA
plugin execution provider (EP) in ONNX Runtime. Profiling is now
available in the plugin EP when built with the
`onnxruntime_ENABLE_CUDA_PROFILING` CMake flag, enabling detailed GPU
activity tracing and integration with ORT's profiling system. The
implementation introduces a new `CudaPluginEpProfiler` that bridges
between ORT's profiling API and CUPTI, and updates the build system,
plugin interface, and documentation accordingly.
**CUDA Plugin Profiling Integration:**
* Added a new `CudaPluginEpProfiler` class
(`cuda_profiler_plugin.h/.cc`) that implements the `OrtEpProfilerImpl`
interface, delegates to a `CUPTIManager` singleton for GPU activity
tracing, and provides callbacks for profiling lifecycle and event
correlation.
[[1]](diffhunk://#diff-1f42eda0693594c09576d132854290df0f39e439d450c79f50e01f9969d0af2dR1-R43)
[[2]](diffhunk://#diff-1dccd750352acaba880066f09b8d8a042d13fae7b3dd5bc103f0ab43685ae2deR1-R148)
* Updated the plugin EP interface in `cuda_ep.h`/`cuda_ep.cc` to
conditionally provide a `CreateProfilerImpl` callback when profiling is
enabled, wiring up the new profiler implementation.
[[1]](diffhunk://#diff-82888350617a2e54bb30b1a11cd2563ecaf2b45ed0baba736674d9156c912b20R95-R99)
[[2]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R137-R143)
[[3]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R661-R678)
* Modified the CMake build (`onnxruntime_providers_cuda_plugin.cmake`)
to conditionally link against `CUDA::cupti` and define the necessary
compile-time flags for profiling support.
**Documentation Updates:**
* Expanded the design documentation (`cuda_plugin_ep_design.md`) to
describe the profiling and observability architecture, CUPTI
integration, correlation ID flow, event collection, and differences from
the in-tree CUDA EP profiler. Build configuration and relevant source
files are also documented.
**Miscellaneous:**
* Included the new profiler header in the plugin EP implementation.
* Minor test and import adjustments (e.g., `test_cuda_plugin_ep.py`).
These changes enable the CUDA plugin EP to participate fully in ORT's
profiling system, allowing users to observe GPU kernel and memory
activity in conjunction with CPU-side events when profiling is enabled.