[webgpu] Optimize string stream used in WebGPU EP (#27223)
### Description
Optimize the string stream used in WebGPU EP.
### Motivation and Context
The current implementation uses a `absl::OStringStream`, which is faster
than `std::ostringstream`. However, it is still slow in the usage of
generating the program cache key.
From the profiling data, `CalculateProgramCacheKey()` is extremely time
consuming. It can consume up to 1/3 of all CPU time inside
`WebGpuContext::Run()`:
<img width="1035" height="185" alt="image"
src="https://github.com/user-attachments/assets/5b9e33cc-cd0a-4ef8-9a92-2ee894b85156"
/>
The basic analyze shows that most time spent in the `std::basic_ostream
operator <<()` implementation, and this is way slower than expected.
To optimize, this PR uses a simplified implementation
`FastOStringStream`, which does not inherit from `std::basic_ostream`.
Instead, the class implementation only includes necessary overrides for
the minimum requirements of generating cache key and shader code, to
reduce the unnecessary overhead as much as possible.
<img width="1016" height="156" alt="image"
src="https://github.com/user-attachments/assets/32e3d345-c56d-4e6d-89e1-99cc7b150d8e"
/>
As a result, the CPU sampling of `CalculateProgramCacheKey()` in the
same test dropped from 2555 to 176. Generation TPS of E2E model
benchmark on Qwen3-0.6B increased from ~90 to ~130 on
Windows11/13900k/RTX4070.