onnxruntime
7e3174b0 - [webgpu] Optimize dp4 prefill shader for Qualcomm (#25578)

Commit
194 days ago
[webgpu] Optimize dp4 prefill shader for Qualcomm (#25578) This change uses subgroupShuffle for sg_size=64 to perform the matmul. It also uses a loop instead of loop unrolling to reduce the register pressure. Phi4 prefill for 1K tokens becomes 8.8s from 11.32s on Qualcomm Adreno X1-85 GPU.
Author
Parents
Loading