Add exponential-backoff option for thread pool spin loop (#28096)
# Add exponential-backoff option for thread pool spin loop
## Description
This PR adds an opt-in exponential-backoff mode to the thread pool's
idle spin loop, complementing the configurable `spin_duration_us`
introduced in #27916. When enabled, each spin iteration emits a
geometrically increasing number of `SpinPause()` calls (1, 2, 4, …
capped at `spin_backoff_max`), which reduces pause-instruction density
and lowers CPU/power usage during the spin window—particularly on hybrid
(P/E core) and mobile platforms. The iteration count is automatically
scaled so the wall-clock spin budget targeted by `spin_duration_us` is
preserved.
The idea is adapted from #23278
(https://github.com/microsoft/onnxruntime/pull/21545 and
https://github.com/microsoft/onnxruntime/pull/22315) which showed
measurable power and latency improvements on Intel Meteor Lake by
reducing busy-wait density. This PR makes the technique opt-in and
composable with the time-bounded spin knob from #27916, so users can
independently control *how long* to spin and *how densely* to spin.
## Summary of Changes
### Core thread pool (`EigenNonBlockingThreadPool.h`)
| File | Change |
|------|--------|
| `include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h` | Add
`ThreadPoolWaiter` inner class implementing exponential backoff; add
`NormalizeBackoff()` and `ScaleSpinCountForBackoff()` helpers; replace
bare `SpinPause()` in `WorkerLoop` with `waiter.wait()`; store
`spin_backoff_max_` member and accept it in constructor |
### Configuration plumbing
| File | Change |
|------|--------|
|
`include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h`
| New config keys `session.intra_op.spin_backoff_max` and
`session.inter_op.spin_backoff_max` |
| `include/onnxruntime/core/platform/threadpool.h` | Add
`spin_backoff_max` parameter to `ThreadPool` constructor (default `1`,
backward-compatible) |
| `onnxruntime/core/common/threadpool.cc` | Forward `spin_backoff_max`
to `ThreadPoolTempl` |
| `onnxruntime/core/util/thread_utils.h` | Add `spin_backoff_max` field
to `OrtThreadPoolParams` |
| `onnxruntime/core/util/thread_utils.cc` | Pass `spin_backoff_max` into
`ThreadPool`; log it in `operator<<(OrtThreadPoolParams)` |
| `onnxruntime/core/session/inference_session.cc` | Add
`ParseSpinBackoffMax()` helper; parse & apply both intra-op and inter-op
config keys |
### Perf test CLI & benchmark script
| File | Change |
|------|--------|
| `onnxruntime/test/perftest/command_args_parser.cc` | New
`--spin_backoff_max` flag |
| `onnxruntime/test/perftest/ort_test_session.cc` | Apply flag to
session options |
| `onnxruntime/test/perftest/test_configuration.h` | New
`spin_backoff_max` field in `RunConfig` |
| `tools/perftest/benchmark_spin_settings.py` | New benchmark script
that runs `onnxruntime_perf_test` across a matrix of spin settings
(duration × backoff) and reports latency, throughput, CPU% |
## Key Design Decisions
1. **Default preserves existing behavior.** `spin_backoff_max = 1` means
one `SpinPause()` per iteration—identical to today. No performance
change for users who don't opt in.
2. **Wall-clock budget preservation.** When backoff is enabled, the
iteration count is divided by `spin_backoff_max` so the total number of
`SpinPause()` calls—and therefore the approximate spin duration—stays
the same as the non-backoff path.
3. **Composable with `spin_duration_us`.** Backoff and time-bounded
spinning are orthogonal knobs. Users can use either independently or
combine them (e.g., `spin_duration_us=1000` + `spin_backoff_max=8`).
4. **Subordinate to `allow_spinning`.** When spinning is disabled,
`spin_backoff_max` is ignored—same as `spin_duration_us`.
## Session option usage
```cpp
// Enable exponential backoff with cap 8, combined with 1ms time-bounded spinning
session_options.AddConfigEntry("session.intra_op.spin_duration_us", "1000");
session_options.AddConfigEntry("session.intra_op.spin_backoff_max", "8");
```
## Benchmark Results
Benchmarks run on an Intel i9-13900KF (6P cores / 12 threads under
WSL2), 32 GB RAM, Release build with CPU EP, using
`tools/perftest/benchmark_spin_settings.py`. Each configuration was
repeated 3–5 times (median latency, mean throughput/CPU reported).
Duration: 10 seconds per run.
### SqueezeNet (5 MB CNN) — 16 intra-op threads, 5 repeats
High thread count amplifies spin contention, making this the most
illustrative test:
| Config | `spin_duration_us` | `spin_backoff_max` | Avg Latency (ms) |
Throughput (IPS) | CPU % |
|--------|---|---|:-:|:-:|:-:|
| `default` | (legacy) | (legacy) | 3.243 | 303.0 | 1245.8 |
| `no_spin` | — | — | 5.489 | 176.3 | 332.7 |
| `spin_1000` | 1000 | — | 1.870 | 514.5 | 1214.6 |
| `spin_2000` | 2000 | — | 2.040 | 478.2 | 1219.9 |
| `backoff_8` | (legacy) | 8 | 3.268 | 303.4 | 1257.4 |
| `spin_1000_backoff_4` | 1000 | 4 | 1.849 | 513.8 | 1221.4 |
| **`spin_1000_backoff_8`** | **1000** | **8** | **1.835** | **534.5** |
**1221.1** |
| `spin_2000_backoff_8` | 2000 | 8 | 2.050 | 470.3 | 1223.2 |
**Best: `spin_1000_backoff_8`** — **43% lower latency**, **76% higher
throughput** vs default, while using **2% less CPU**.
### SqueezeNet — 8 intra-op threads
| Config | Avg Latency (ms) | Throughput (IPS) | CPU % |
|--------|:-:|:-:|:-:|
| `default` | 1.578 | 628.7 | 826.5 |
| `no_spin` | 3.742 | 261.1 | 322.9 |
| `spin_1000` | 1.547 | 618.3 | 826.2 |
| `backoff_8` | 1.545 | 628.7 | 830.1 |
| `spin_1000_backoff_8` | **1.519** | **657.5** | 838.6 |
| `spin_2000_backoff_8` | **1.503** | 634.9 | 832.8 |
**Best: `spin_1000_backoff_8`** — **3.7% lower latency**, **4.6% higher
throughput** vs default.
### DistilBERT (254 MB Transformer) — 4 intra-op threads
| Config | Avg Latency (ms) | Throughput (IPS) | CPU % |
|--------|:-:|:-:|:-:|
| `default` | 30.468 | 31.4 | 329.3 |
| `no_spin` | 33.483 | 28.8 | 284.5 |
| `spin_1000` | 30.421 | 31.5 | 338.1 |
| `backoff_8` | 30.254 | 31.3 | 344.4 |
| **`spin_1000_backoff_8`** | **29.583** | **31.8** | 340.5 |
**Best: `spin_1000_backoff_8`** — **2.9% lower latency** vs default.
### DistilBERT — 8 intra-op threads
| Config | Avg Latency (ms) | Throughput (IPS) | CPU % |
|--------|:-:|:-:|:-:|
| `default` | 23.194 | 41.4 | 672.1 |
| `no_spin` | 32.548 | 32.2 | 395.0 |
| `spin_1000` | 23.291 | 41.3 | 675.3 |
| **`backoff_8`** | **22.995** | **43.2** | 705.3 |
| `spin_1000_backoff_8` | 23.535 | 41.1 | 662.8 |
**Best: `backoff_8`** — **0.9% lower latency**, **4.3% higher
throughput** vs default.
### Summary
- **`spin_1000_backoff_8`** is the most consistent best performer across
models and thread counts.
- Benefits grow with thread count: from ~3% at 4T to **43–76%** at 16T.
- No throughput regressions observed in any backoff configuration vs its
non-backoff equivalent.
- Backoff configs use slightly less CPU than raw spinning while
achieving higher throughput — a win-win on power/efficiency.
## Testing
- **Backward compatibility:** Default `spin_backoff_max = 1` produces
identical spin behavior to `main`. Existing thread pool tests
(`SpinDurationDefault`, `SpinDurationZero_NoSpinning`,
`SpinDurationPositive_TimeBased`) continue to pass unmodified since the
default backoff is 1.
- **Benchmark script:** Use the new benchmark tool to compare settings
on a model:
```bash
python tools/perftest/benchmark_spin_settings.py \
--perf_test build/Release/onnxruntime_perf_test \
--model path/to/model.onnx \
--intra_op 4 --duration 10 --repeats 3 \
--configs default spin_1000 spin_1000_backoff_8
```
- **Build verification:** All modified translation units compile cleanly
under `-Wall -Wextra -Werror` in the existing cu128 Release build.