onnxruntime
3f74b3cf - Update worker thread pool to use time based wait. (#27916)

Commit

16 days ago

Update worker thread pool to use time based wait. (#27916) # Make thread pool spin duration configurable via session option ## Problem The ORT Eigen thread pool's `SpinPause` loop uses a fixed iteration count (`1 << 20` = ~1M iterations) before blocking. The actual wall-clock spin duration varies dramatically by CPU architecture: | Pause Instruction | Architecture | Spin Duration (1M iterations) | |---|---|---| | `_mm_pause` | Pre-Skylake | ~3ms | | `_mm_pause` | Skylake+ @ 3 GHz | ~47ms | | `_tpause` | 3 GHz base | ~333ms | | `_tpause` | 2 GHz base | ~500ms | For client/on-device workloads (e.g., Whisper in Edge), this causes high CPU utilization visible in profilers and Task Manager, even though the CPU is in a low-power spin state. So 1M iterations at 3 GHz: - **Pre-Skylake:** 1M × 10 / 3G ≈ **3.3ms** - **Skylake @ 3 GHz:** 1M × 140 / 3G ≈ **47ms** - **Skylake @ 5 GHz (turbo):** 1M × 140 / 5G ≈ **28ms** - **AMD Zen @ 4 GHz:** 1M × 65 / 4G ≈ **16ms** The total duration scaled inversely with clock speed and varied dramatically across microarchitectures. The Skylake 14x increase was specifically because Intel found that the short pause was causing too much power waste and memory bus contention in spin loops. ### `_tpause` `_tpause(0x0, __rdtsc() + 1000)` waits for a fixed number of TSC ticks. TSC frequency is typically fixed at the processor's base frequency (not turbo), so: - **3 GHz base:** 1000 ticks ≈ 333ns per iteration → 1M iterations ≈ **333ms** - **2 GHz base:** 1000 ticks ≈ 500ns per iteration → 1M iterations ≈ **500ms** The per-iteration time is more predictable than `_mm_pause` (TSC is constant-rate on modern CPUs), but still scales with TSC frequency. The total spin is much longer because each iteration is ~333ns vs ~28–47ns for `_mm_pause` on Skylake+. ### Profiler visibility Both `_tpause` and `_mm_pause` are treated as **CPU busy** in Task Manager and ETW sampling profilers, even though these are low-power CPU states. This ends up looking like Edge consuming all the CPU during speech recognition. ## Solution This PR makes the thread pool spin behavior configurable while **preserving the default (original) behavior** for backward compatibility: - **Default (`-1`)**: Uses the original iteration-count-based spin loop (1M iterations). Unchanged throughput characteristics. - **`0`**: Disables spinning entirely (threads block immediately). - **`> 0`**: Enables time-based spinning for the specified duration in microseconds using `std::chrono::steady_clock`. Recommended for power-sensitive workloads. ### Session option usage ```cpp // Use time-based spinning with 1ms duration (recommended for on-device/client workloads) session_options.AddConfigEntry("session.intra_op.spin_duration_us", "1000"); // Disable spinning entirely session_options.AddConfigEntry("session.intra_op.spin_duration_us", "0"); ``` Both intra-op and inter-op thread pools are independently configurable via `session.intra_op.spin_duration_us` and `session.inter_op.spin_duration_us`. ## Changes ### Core thread pool (EigenNonBlockingThreadPool.h) - `WorkerLoop` now has two spin paths selected by `spin_duration_us_`: - Negative (default): original iteration-count loop, identical to `main` - Positive: time-based spin using `steady_clock` with power-of-2 bitmask optimizations for steal interval and clock-read frequency - Constructor parameter changed from `bool allow_spinning` → `int spin_duration_us` - `ComputeTimeCheckMask()`: dynamically computes clock-read frequency based on spin duration (clamped to [128, 4096] iterations) to keep overhead under 1% ### Configuration plumbing - New session config keys: `session.intra_op.spin_duration_us`, `session.inter_op.spin_duration_us` - `OrtThreadPoolParams.spin_duration_us` field with sentinel default `-1` - `ParseSpinDurationUs()` helper using `TryParseStringWithClassicLocale` for safe parsing - `allow_spinning` and `spin_duration_us` merged at `CreateThreadPoolHelper`: when `allow_spinning=false`, spin duration is forced to `0` ### Test updates - All 8 internal call sites passing `bool true` updated to `concurrency::kSpinDurationDefault` to avoid silent implicit bool-to-int conversion - `onnxruntime_perf_test` supports `--spin_duration_us` CLI flag - Thread pool benchmarks use `kSpinDurationDefault` ## Key design decisions 1. **Default preserves original behavior**: No performance regression for existing users. Benchmarks confirmed the iteration-count path matches `main`. 2. **`steady_clock` over `high_resolution_clock`**: Monotonic guarantee prevents spin-deadline issues from clock jumps. 3. **`unsigned int` loop counter**: Prevents signed overflow in the unbounded time-based spin loop. 4. **Power-of-2 bitmask optimization**: Steal every 128 iterations (`& 0x7F`), clock checks at a separate frequency computed from spin duration — avoids modulo operations in the hot loop. # Results <img width="3838" height="1478" alt="image" src="https://github.com/user-attachments/assets/265a0af0-4ed7-46ae-8263-96553bb592b2" /> LHS shows the problem where 85% of CPU time is spent in SpinWait. RHS shows the same trace with the fix, 50% lower CPU utilization the length of the usage spikes drop from 527ms to 130ms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>

References

#27916 - Update worker thread pool to use time based wait.

Author

sushraja-msft

Parents

4dd5d36c

onnxruntime 3f74b3cf - Update worker thread pool to use time based wait. (#27916)

onnxruntime
3f74b3cf - Update worker thread pool to use time based wait. (#27916)