Implement mutex-free spin lock for task queue (#14834)
Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free
version of ort (40 threads) saves CPU usage on gen8 (128 logical
processors on 8 numa nodes) windows by nearly half, from 65% to 35%.
For 32 cores, the curve is flat:
Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:
model | mutex(ms) | mutex-free
--- | --- | ---
alvert_base_v2 | 34.21 | 34.09
bert_large_uncased | 116.27| 117.84
bart_base | 72.06 | 71.99
distilgpt2 | 25.43 | 25.02
vit_base_patch16_224 | 37.33 | 37.76
Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms:
model | mutex(ms) | mutex-free
--- | --- | ---
deepthink_v2 | 24.35 | 22.95
bing_feeds | 36.96 | 36.48
deep_writes | 14.46 | 14.32
keypoints | 9.34 | 7.69
model11 | 1.71 | 1.66
model12 | 1.82 | 1.44
model2 | 4.21 | 3.95
model6 | 1.08 | 1.05
agiencoder | 0.99 | 0.93
geminet_transformer | 5.32 | 5.24
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>