onnxruntime
90d92f82 - [webgpu] Support RotaryEmbedding in flash attention (#26297)

Commit

125 days ago

[webgpu] Support RotaryEmbedding in flash attention (#26297) ### Description We enabled Rotary Embedding (ROE) support in the Flash Attention code and introduced a fused operator, `FusedQKRotaryEmbeddingProgram`, which combines two ROE operations into a single kernel. ### Motivation and Context Previously, Flash Attention for WebGPU did not support ROE, while the CUDA version did. As a result, different operators were used for the Phi-4 model. With this PR, both CUDA and WebGPU can share the same Phi-4 model implementation. Fusing the ROE operations within the attention module helps reduce CPU-side overhead by: 1. Decreasing the number of MatMulNbits operations. 2. Reducing memcpyFromHost calls. This optimization improves **token generation speed** on high-performance GPUs — **achieving over 5% speedup on NVIDIA 5080 and 4% on Apple M3 Max.** We believe the generation phase is CPU-bound on such high-end GPUs, so reducing CPU time leads to noticeable gains. However, the improvement is negligible on low-performance GPUs. <img width="1589" height="870" alt="image" src="https://github.com/user-attachments/assets/ae8a73b3-3e41-4ae6-9687-43058c2be140" />

References

#26297 - [webgpu] Support RotaryEmbedding in flash attention

Author

xiaofeihan1

Parents

036fde03

onnxruntime 90d92f82 - [webgpu] Support RotaryEmbedding in flash attention (#26297)

onnxruntime
90d92f82 - [webgpu] Support RotaryEmbedding in flash attention (#26297)