onnxruntime
5743f714 - Add exponential-backoff option for thread pool spin loop (#28096)

Commit
10 days ago
Add exponential-backoff option for thread pool spin loop (#28096) # Add exponential-backoff option for thread pool spin loop ## Description This PR adds an opt-in exponential-backoff mode to the thread pool's idle spin loop, complementing the configurable `spin_duration_us` introduced in #27916. When enabled, each spin iteration emits a geometrically increasing number of `SpinPause()` calls (1, 2, 4, … capped at `spin_backoff_max`), which reduces pause-instruction density and lowers CPU/power usage during the spin window—particularly on hybrid (P/E core) and mobile platforms. The iteration count is automatically scaled so the wall-clock spin budget targeted by `spin_duration_us` is preserved. The idea is adapted from #23278 (https://github.com/microsoft/onnxruntime/pull/21545 and https://github.com/microsoft/onnxruntime/pull/22315) which showed measurable power and latency improvements on Intel Meteor Lake by reducing busy-wait density. This PR makes the technique opt-in and composable with the time-bounded spin knob from #27916, so users can independently control *how long* to spin and *how densely* to spin. ## Summary of Changes ### Core thread pool (`EigenNonBlockingThreadPool.h`) | File | Change | |------|--------| | `include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h` | Add `ThreadPoolWaiter` inner class implementing exponential backoff; add `NormalizeBackoff()` and `ScaleSpinCountForBackoff()` helpers; replace bare `SpinPause()` in `WorkerLoop` with `waiter.wait()`; store `spin_backoff_max_` member and accept it in constructor | ### Configuration plumbing | File | Change | |------|--------| | `include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h` | New config keys `session.intra_op.spin_backoff_max` and `session.inter_op.spin_backoff_max` | | `include/onnxruntime/core/platform/threadpool.h` | Add `spin_backoff_max` parameter to `ThreadPool` constructor (default `1`, backward-compatible) | | `onnxruntime/core/common/threadpool.cc` | Forward `spin_backoff_max` to `ThreadPoolTempl` | | `onnxruntime/core/util/thread_utils.h` | Add `spin_backoff_max` field to `OrtThreadPoolParams` | | `onnxruntime/core/util/thread_utils.cc` | Pass `spin_backoff_max` into `ThreadPool`; log it in `operator<<(OrtThreadPoolParams)` | | `onnxruntime/core/session/inference_session.cc` | Add `ParseSpinBackoffMax()` helper; parse & apply both intra-op and inter-op config keys | ### Perf test CLI & benchmark script | File | Change | |------|--------| | `onnxruntime/test/perftest/command_args_parser.cc` | New `--spin_backoff_max` flag | | `onnxruntime/test/perftest/ort_test_session.cc` | Apply flag to session options | | `onnxruntime/test/perftest/test_configuration.h` | New `spin_backoff_max` field in `RunConfig` | | `tools/perftest/benchmark_spin_settings.py` | New benchmark script that runs `onnxruntime_perf_test` across a matrix of spin settings (duration × backoff) and reports latency, throughput, CPU% | ## Key Design Decisions 1. **Default preserves existing behavior.** `spin_backoff_max = 1` means one `SpinPause()` per iteration—identical to today. No performance change for users who don't opt in. 2. **Wall-clock budget preservation.** When backoff is enabled, the iteration count is divided by `spin_backoff_max` so the total number of `SpinPause()` calls—and therefore the approximate spin duration—stays the same as the non-backoff path. 3. **Composable with `spin_duration_us`.** Backoff and time-bounded spinning are orthogonal knobs. Users can use either independently or combine them (e.g., `spin_duration_us=1000` + `spin_backoff_max=8`). 4. **Subordinate to `allow_spinning`.** When spinning is disabled, `spin_backoff_max` is ignored—same as `spin_duration_us`. ## Session option usage ```cpp // Enable exponential backoff with cap 8, combined with 1ms time-bounded spinning session_options.AddConfigEntry("session.intra_op.spin_duration_us", "1000"); session_options.AddConfigEntry("session.intra_op.spin_backoff_max", "8"); ``` ## Benchmark Results Benchmarks run on an Intel i9-13900KF (6P cores / 12 threads under WSL2), 32 GB RAM, Release build with CPU EP, using `tools/perftest/benchmark_spin_settings.py`. Each configuration was repeated 3–5 times (median latency, mean throughput/CPU reported). Duration: 10 seconds per run. ### SqueezeNet (5 MB CNN) — 16 intra-op threads, 5 repeats High thread count amplifies spin contention, making this the most illustrative test: | Config | `spin_duration_us` | `spin_backoff_max` | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|---|---|:-:|:-:|:-:| | `default` | (legacy) | (legacy) | 3.243 | 303.0 | 1245.8 | | `no_spin` | — | — | 5.489 | 176.3 | 332.7 | | `spin_1000` | 1000 | — | 1.870 | 514.5 | 1214.6 | | `spin_2000` | 2000 | — | 2.040 | 478.2 | 1219.9 | | `backoff_8` | (legacy) | 8 | 3.268 | 303.4 | 1257.4 | | `spin_1000_backoff_4` | 1000 | 4 | 1.849 | 513.8 | 1221.4 | | **`spin_1000_backoff_8`** | **1000** | **8** | **1.835** | **534.5** | **1221.1** | | `spin_2000_backoff_8` | 2000 | 8 | 2.050 | 470.3 | 1223.2 | **Best: `spin_1000_backoff_8`** — **43% lower latency**, **76% higher throughput** vs default, while using **2% less CPU**. ### SqueezeNet — 8 intra-op threads | Config | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|:-:|:-:|:-:| | `default` | 1.578 | 628.7 | 826.5 | | `no_spin` | 3.742 | 261.1 | 322.9 | | `spin_1000` | 1.547 | 618.3 | 826.2 | | `backoff_8` | 1.545 | 628.7 | 830.1 | | `spin_1000_backoff_8` | **1.519** | **657.5** | 838.6 | | `spin_2000_backoff_8` | **1.503** | 634.9 | 832.8 | **Best: `spin_1000_backoff_8`** — **3.7% lower latency**, **4.6% higher throughput** vs default. ### DistilBERT (254 MB Transformer) — 4 intra-op threads | Config | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|:-:|:-:|:-:| | `default` | 30.468 | 31.4 | 329.3 | | `no_spin` | 33.483 | 28.8 | 284.5 | | `spin_1000` | 30.421 | 31.5 | 338.1 | | `backoff_8` | 30.254 | 31.3 | 344.4 | | **`spin_1000_backoff_8`** | **29.583** | **31.8** | 340.5 | **Best: `spin_1000_backoff_8`** — **2.9% lower latency** vs default. ### DistilBERT — 8 intra-op threads | Config | Avg Latency (ms) | Throughput (IPS) | CPU % | |--------|:-:|:-:|:-:| | `default` | 23.194 | 41.4 | 672.1 | | `no_spin` | 32.548 | 32.2 | 395.0 | | `spin_1000` | 23.291 | 41.3 | 675.3 | | **`backoff_8`** | **22.995** | **43.2** | 705.3 | | `spin_1000_backoff_8` | 23.535 | 41.1 | 662.8 | **Best: `backoff_8`** — **0.9% lower latency**, **4.3% higher throughput** vs default. ### Summary - **`spin_1000_backoff_8`** is the most consistent best performer across models and thread counts. - Benefits grow with thread count: from ~3% at 4T to **43–76%** at 16T. - No throughput regressions observed in any backoff configuration vs its non-backoff equivalent. - Backoff configs use slightly less CPU than raw spinning while achieving higher throughput — a win-win on power/efficiency. ## Testing - **Backward compatibility:** Default `spin_backoff_max = 1` produces identical spin behavior to `main`. Existing thread pool tests (`SpinDurationDefault`, `SpinDurationZero_NoSpinning`, `SpinDurationPositive_TimeBased`) continue to pass unmodified since the default backoff is 1. - **Benchmark script:** Use the new benchmark tool to compare settings on a model: ```bash python tools/perftest/benchmark_spin_settings.py \ --perf_test build/Release/onnxruntime_perf_test \ --model path/to/model.onnx \ --intra_op 4 --duration 10 --repeats 3 \ --configs default spin_1000 spin_1000_backoff_8 ``` - **Build verification:** All modified translation units compile cleanly under `-Wall -Wextra -Werror` in the existing cu128 Release build.
Author
Parents
Loading