onnxruntime
2d5316f1 - [webgpu] Use workgroup memory to reduce register pressure (#24286)

Commit
255 days ago
[webgpu] Use workgroup memory to reduce register pressure (#24286) On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance. TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.
Author
Parents
Loading