Enable QuickGeluFusion on WebGPU EP and fix fp16 shader (#28410)
### Description
Two related changes to enable a missing fusion path on the WebGPU EP:
1. **Register `QuickGeluFusion` for WebGPU and JS EPs.** It was
previously
registered only for CPU/ACL/CUDA/DML, so SiLU patterns (`x *
Sigmoid(x)`,
used by SwiGLU MLPs in Qwen3.5, Llama, Phi, etc.) were left as separate
`Sigmoid + Mul` ops on WebGPU. Fusing them into a single `QuickGelu`
kernel saves one dispatch per MLP layer per token.
2. **Fix the `QuickGelu` WGSL shader for fp16 inputs.** The literals
`1.0`
and `0.0` were inferred as f32, which conflicts with `x_element_t = f16`
under WGSL strict typing and caused shader compile failures. Wrap them
in `x_element_t(...)` casts (same pattern as `HardSigmoidImpl`). Without
this, the newly-enabled fusion would fail at pipeline creation on fp16
models.
### Motivation and Context
WebGPU was the only major EP missing this fusion. The SiLU pattern is
extremely common in modern LLMs (every SwiGLU-based decoder layer hits
it), so the dispatch reduction is meaningful for decode (token
generation)
performance, where each saved dispatch matters relative to the small
per-step compute.
### Measured impact
NVIDIA RTX 5080 (Blackwell), 6 trials, dropped first cold run, avg of 5.
Comparison is against the current `main` branch DLL (no fusion on
WebGPU).
| Model | Prompt TPS (main → fused) | Decode TPS (main → fused) | Δ
decode |
|---|---:|---:|---:|
| Qwen3.5-0.8B (int4) | 7563 → 7672 | 98.6 → 103.7 | **+5.2%** |
| Qwen3.5-4B (int4) | 2349 → 2366 | 55.7 → 56.3 | **+1.1%** |
| Phi-4-mini (int4) | 2924 → 2937 | 128.7 → 130.1 | **+1.1%** |
Pattern: the smaller the model, the larger the relative gain — fusion
saves a fixed per-step dispatch overhead, and that overhead is a larger
fraction of the per-step time for smaller models. All three models
showed
neutral-to-positive prompt TPS; no regressions observed.
Cross-vendor sanity check on Qwen3.5-0.8B (Intel iGPU, same DLL pair):
decode 39.1 → 38.6 TPS, within trial-to-trial noise (~3%). No regression
on Intel; the gain is concentrated where dispatch overhead matters most.
Output text was validated coherent on every configuration before
recording numbers.