onnxruntime
a49da870 - Enable QuickGeluFusion on WebGPU EP and fix fp16 shader (#28410)

Commit
1 day ago
Enable QuickGeluFusion on WebGPU EP and fix fp16 shader (#28410) ### Description Two related changes to enable a missing fusion path on the WebGPU EP: 1. **Register `QuickGeluFusion` for WebGPU and JS EPs.** It was previously registered only for CPU/ACL/CUDA/DML, so SiLU patterns (`x * Sigmoid(x)`, used by SwiGLU MLPs in Qwen3.5, Llama, Phi, etc.) were left as separate `Sigmoid + Mul` ops on WebGPU. Fusing them into a single `QuickGelu` kernel saves one dispatch per MLP layer per token. 2. **Fix the `QuickGelu` WGSL shader for fp16 inputs.** The literals `1.0` and `0.0` were inferred as f32, which conflicts with `x_element_t = f16` under WGSL strict typing and caused shader compile failures. Wrap them in `x_element_t(...)` casts (same pattern as `HardSigmoidImpl`). Without this, the newly-enabled fusion would fail at pipeline creation on fp16 models. ### Motivation and Context WebGPU was the only major EP missing this fusion. The SiLU pattern is extremely common in modern LLMs (every SwiGLU-based decoder layer hits it), so the dispatch reduction is meaningful for decode (token generation) performance, where each saved dispatch matters relative to the small per-step compute. ### Measured impact NVIDIA RTX 5080 (Blackwell), 6 trials, dropped first cold run, avg of 5. Comparison is against the current `main` branch DLL (no fusion on WebGPU). | Model | Prompt TPS (main → fused) | Decode TPS (main → fused) | Δ decode | |---|---:|---:|---:| | Qwen3.5-0.8B (int4) | 7563 → 7672 | 98.6 → 103.7 | **+5.2%** | | Qwen3.5-4B (int4) | 2349 → 2366 | 55.7 → 56.3 | **+1.1%** | | Phi-4-mini (int4) | 2924 → 2937 | 128.7 → 130.1 | **+1.1%** | Pattern: the smaller the model, the larger the relative gain — fusion saves a fixed per-step dispatch overhead, and that overhead is a larger fraction of the per-step time for smaller models. All three models showed neutral-to-positive prompt TPS; no regressions observed. Cross-vendor sanity check on Qwen3.5-0.8B (Intel iGPU, same DLL pair): decode 39.1 → 38.6 TPS, within trial-to-trial noise (~3%). No regression on Intel; the gain is concentrated where dispatch overhead matters most. Output text was validated coherent on every configuration before recording numbers.
Author
Parents
Loading