onnxruntime
a61fb39e - [webgpu] Fix poor performance in flash attention for Qualcomm devices (#25730)

Commit
161 days ago
[webgpu] Fix poor performance in flash attention for Qualcomm devices (#25730) It seems that when multiple threads in one subgroup access the same shared memory location, the performance is poor on Qualcomm devices (bank conflicts?). If we limit the number of threads accessing the same memory location, the performance is greatly improved on Qualcomm devices. Phi4 becomes ~10s from ~13s on QC Adreno X1-85 (31.0.112.0).
Author
Parents
Loading