onnxruntime
5bc10a39 - [webgpu] Optimize AttentionPrepare (#26850)

Commit

32 days ago

[webgpu] Optimize AttentionPrepare (#26850) This pull request refactors and streamlines the computation of Q, K, V tensors in the WebGPU BERT Attention operator. The main changes include removing a custom QKV preparation kernel in favor of a more modular approach using a MatMul operation followed by a dedicated split kernel, and generalizing the QKV splitting logic for broader reuse. This improves maintainability, code reuse, and performance since we have done many optimization on MatMul op. With this change, PrepareQKV becomes 128.88 ms from 751.67 ms in phi4-vision model. Before Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|AttentionPrepare | 751.67 | 49.91 After Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|MatMul | 120.87 | 19.77 Attention\|SplitPackedQKV | 1.94 | 0.32

References

#26850 - [webgpu] Optimize AttentionPrepare

Author

qjia7

Parents

63b5cefc

onnxruntime 5bc10a39 - [webgpu] Optimize AttentionPrepare (#26850)

onnxruntime
5bc10a39 - [webgpu] Optimize AttentionPrepare (#26850)