onnxruntime
5bc10a39 - [webgpu] Optimize AttentionPrepare (#26850)

Commit
32 days ago
[webgpu] Optimize AttentionPrepare (#26850) This pull request refactors and streamlines the computation of Q, K, V tensors in the WebGPU BERT Attention operator. The main changes include removing a custom QKV preparation kernel in favor of a more modular approach using a MatMul operation followed by a dedicated split kernel, and generalizing the QKV splitting logic for broader reuse. This improves maintainability, code reuse, and performance since we have done many optimization on MatMul op. With this change, PrepareQKV becomes 128.88 ms from 751.67 ms in phi4-vision model. Before Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|AttentionPrepare | 751.67 | 49.91 After Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|MatMul | 120.87 | 19.77 Attention\|SplitPackedQKV | 1.94 | 0.32
Author
Parents
Loading