Add enable_profiling in runoptions (#26846)
### Description
Support run-level profiling
This PR adds support for profiling individual Run executions, similar to
session-level profiling. Developers can enable run-level profiling by
setting `enable_profiling` and `profile_file_prefix` in RunOptions. Once
the run completes, a JSON profiling file will be saved using
profile_file_prefix + timestamp.
<img width="514" height="281" alt="png (2)"
src="https://github.com/user-attachments/assets/8a997068-71d9-49ed-8a5c-00e0fa8853af"
/>
### Key Changes
1. Introduced a local variable `run_profiler` in
`InferenceSession::Run`, which is destroyed after the run completes.
Using a dedicated profiler per run ensures that profiling data is
isolated and prevents interleaving or corruption across runs.
2. To maintain accurate execution time when both session-level and
run-level profiling are enabled, overloaded `Start` and
`EndTimeAndRecordEvent` functions have been added. These allow the
caller to provide timestamps instead of relying on
`std::chrono::high_resolution_clock::now()`, avoiding potential timing
inaccuracies.
3. Added a TLS variable `tls_run_profiler_` to support run-level
profiling with WebGPU Execution Provider (EP). This ensures that when
multiple threads enable run-level profiling, each thread logs only to
its own WebGPU profiler, keeping thread-specific data isolated.
4. Use `HH:MM:SS.mm` instead of `HH:MM:SS`in the JSON filename to
prevent conflicts when profiling multiple consecutive runs.
### Motivation and Context
Previously, profiling only for session level. Sometimes developer want
to profile for specfic run . so the PR comes.
### Some details
When profiling is enabled via RunOptions, it should ideally collect two
types of events:
1. Profiler events
Used to calculate the CPU execution time of each operator.
2. Execution Provider (EP) profiler events
Used to measure GPU kernel execution time.
Unlike session-level, we need to ensure the collecting events is correct
for multiple thread scenario.
For 1, this can be supported easily(sequential_executor.cc). We use a
thread-local storage (TLS) variable, RunLevelState (defined in
profiler.h), to maintain run-level profiling state for each thread.
For 2, each Execution Provider (EP) has its own profiler implementation,
and each EP must ensure correct behavior under run-level profiling. This
PR ensures that the WebGPU profiler works correctly with run-level
profiling.
# Test Cases
| Scenario | Example | Expected Result |
|---------|---------|-----------------|
| Concurrent runs on the same session with different run-level profiling
settings| t1: `sess1.Run({ enable_profiling: true })`<br>t2:
`sess1.Run({ enable_profiling: false })`<br>t3: `sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one for
`t1` and one for `t3`. |
| Run-level profiling enabled together with session-level profiling|
`sess1 = OrtSession({ enable_profiling: true })`<br>`sess1.Run({
enable_profiling: true })` | Two trace JSON files are generated: one
corresponding to session-level profiling and one corresponding to
run-level profiling. |