onnxruntime
e21b948d - [webgpu] Optimize string stream used in WebGPU EP (#27223)

Commit
124 days ago
[webgpu] Optimize string stream used in WebGPU EP (#27223) ### Description Optimize the string stream used in WebGPU EP. ### Motivation and Context The current implementation uses a `absl::OStringStream`, which is faster than `std::ostringstream`. However, it is still slow in the usage of generating the program cache key. From the profiling data, `CalculateProgramCacheKey()` is extremely time consuming. It can consume up to 1/3 of all CPU time inside `WebGpuContext::Run()`: <img width="1035" height="185" alt="image" src="https://github.com/user-attachments/assets/5b9e33cc-cd0a-4ef8-9a92-2ee894b85156" /> The basic analyze shows that most time spent in the `std::basic_ostream operator <<()` implementation, and this is way slower than expected. To optimize, this PR uses a simplified implementation `FastOStringStream`, which does not inherit from `std::basic_ostream`. Instead, the class implementation only includes necessary overrides for the minimum requirements of generating cache key and shader code, to reduce the unnecessary overhead as much as possible. <img width="1016" height="156" alt="image" src="https://github.com/user-attachments/assets/32e3d345-c56d-4e6d-89e1-99cc7b150d8e" /> As a result, the CPU sampling of `CalculateProgramCacheKey()` in the same test dropped from 2555 to 176. Generation TPS of E2E model benchmark on Qwen3-0.6B increased from ~90 to ~130 on Windows11/13900k/RTX4070.
Author
Parents
Loading