Plugin EP event profiling APIs (#27649)
### Description
#### TLDR
This PR ports the existing C++
[EpProfiler](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/framework/execution_provider.h#L359)
interfaces used by provider-bridge EPs to the binary-stable C APIs for
plugin EPs. It introduces C/C++ APIs for creating/querying profiling
events, a container for appending EP events, and callback hooks
(`StartEvent`/`StopEvent`) that give EPs access to ORT event metadata in
real-time.
#### Changes to the original C++ API
The original `EpProfiler` C++ interface was adapted for the C API with
the following intentional changes:
1. **`StartProfiling`** now receives an offset indicating the elapsed
time since profiling started, as opposed to receiving an
absolute/epoch-dependent profiling start time. This prevents EPs from
having to do epoch conversions. Credit to @edgchen1 for the idea.
2. **`StartEvent`/`StopEvent` receive an absolute, epoch-based
correlation ID (`ort_event_correlation_id`)** instead of a relative ORT
event ID. The `PluginEpProfiler` bridge layer automatically converts the
C++ `relative_ort_event_id` (microseconds since profiling start) to an
absolute `ort_event_correlation_id` by adding the epoch-based profiling
start time. This means plugin EPs can use the correlation ID directly
with profiling utilities like CUPTI or ROCTracer without computing the
conversion themselves.
3. **`StopEvent` now receives the completed ORT event as a parameter.**
This allows EPs to optionally inspect ORT event metadata (e.g.,
`op_name`, `event_name`) at the time the event ends, facilitating
annotation of correlated EP events.
4. **`EndProfiling` only allows EPs to *append* events (via
`OrtProfilingEventsContainer`), not read or modify the full events
array.** This is motivated by:
- Prevent any one EP from modifying events generated by ORT or another
EP.
- Certain EPs (VitisAI and WebGPU) already only append events without
reading the entire events array.
- The CUDA EP reads the entire events array solely to merge/sort its own
EP events next to correlated ORT events and add `parent_name`/`op_name`
metadata. However:
- Merging/sorting is mostly unnecessary since trace viewers that load
these files do their own event sorting.
- This merging/sorting step was previously required to augment CUDA EP
events with metadata from the correlated ORT event. However, that can
now be obtained more simply via the new `StopEvent` parameter that
provides the EP with the full correlated ORT event.
- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
https://github.com/microsoft/onnxruntime/pull/13706#discussion_r1042750808).
Fixing this would require sorting the entire Events array before asking
a provider-bridge EP to merge in its events into the global events
array. Not sure this is worth the runtime cost.
#### Naming conventions for ORT event IDs
- **C++ `EpProfiler` interface** (existing): Uses
`relative_ort_event_id` — a timestamp offset in microseconds relative to
profiling start.
- **C API `OrtEpProfilerImpl`** (new in this PR): Uses
`ort_event_correlation_id` — an absolute, epoch-based timestamp in
microseconds computed from `std::chrono::high_resolution_clock`
(platform-defined epoch). Unique across concurrent profiling sessions
within the same process.
- **Conversion**: The `PluginEpProfiler` bridge class (in
`ep_event_profiling.cc`) performs `ort_event_correlation_id =
relative_ort_event_id + profiling_start_time_epoch_us_`, mirroring the
pattern in `GPUTracerManager::PushCorrelation`.
### New C APIs
| API | Description |
|-----|-------------|
| `CreateProfilingEvent` | Create a profiling event with category,
process/thread IDs, name, timestamp, duration, and key-value args |
| `ReleaseProfilingEvent` | Release a profiling event |
| `ProfilingEvent_GetCategory` | Get event category (`SESSION`, `NODE`,
`KERNEL`, `API`) |
| `ProfilingEvent_GetName` | Get event name |
| `ProfilingEvent_GetTimestampUs` | Get event start timestamp (µs) |
| `ProfilingEvent_GetDurationUs` | Get event duration (µs) |
| `ProfilingEvent_GetArgValue` | Get an event argument value by key |
| `ProfilingEventsContainer_AddEvents` | Append an array of EP events to
the output container |
| `OrtEp::CreateProfiler` | Returns an instance of the EP's profiler
implementation |
| `OrtEpProfilerImpl::StartProfiling` | Called by ORT to start a
profiling session. Receives elapsed time offset (ns) since ORT profiling
started |
| `OrtEpProfilerImpl::StartEvent` | Called by ORT to notify that an ORT
event has started. Receives an absolute `ort_event_correlation_id` |
| `OrtEpProfilerImpl::StopEvent` | Called by ORT to notify that an ORT
event has ended. Receives the same `ort_event_correlation_id` and ORT
event metadata |
| `OrtEpProfilerImpl::EndProfiling` | Called by ORT to end the profiling
session and collect EP events into the output container |
| `OrtEpProfilerImpl::Release` | Release the profiler instance |
### New C++ wrapper classes
| Class | Description |
|-------|-------------|
| `Ort::ConstProfilingEvent` | Non-owning const wrapper for reading
fields from an `OrtProfilingEvent` (e.g., in `StopEvent`) |
| `Ort::ProfilingEvent` | Owning wrapper that creates and manages an
`OrtProfilingEvent` (e.g., for `EndProfiling`) |
| `Ort::UnownedProfilingEventsContainer` | Non-owning wrapper for adding
events to an `OrtProfilingEventsContainer` during `EndProfiling` |
### Example EP profiling implementation
This PR updates an example plugin EP to use the new profiling APIs:
- Plugin EP code:
[test/autoep/library/example_plugin_ep_kernel_registry](https://github.com/microsoft/onnxruntime/tree/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry)
- `OrtEpProfilerImpl` implementation:
[ep_profiling.h](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h)
/
[ep_profiling.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc)
- `OrtEp::CreateProfiler()` implementation:
[ep.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep.cc)
### Existing bugs found
Not fixed in this PR.
- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
https://github.com/microsoft/onnxruntime/pull/13706#discussion_r1042750808).
- Run profilers do not handle subgraphs (e.g., subgraph of a
control-flow operator). Has been the case since run profilers were
[introduced](https://github.com/microsoft/onnxruntime/pull/26846).
### Motivation and Context
Allows plugin EPs to generate profiling events, further closing the
functionality gap between provider-bridge EPs and plugin EPs.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>