[webgpu] Optimize AttentionPrepare (#26850)
This pull request refactors and streamlines the computation of Q, K, V
tensors in the WebGPU BERT Attention operator. The main changes include
removing a custom QKV preparation kernel in favor of a more modular
approach using a MatMul operation followed by a dedicated split kernel,
and generalizing the QKV splitting logic for broader reuse. This
improves maintainability, code reuse, and performance since we have done
many optimization on MatMul op.
With this change, PrepareQKV becomes 128.88 ms from 751.67 ms in
phi4-vision model.
Before
Kernel | Time (ms) | Percentage (%)
-- | -- | --
Attention\|AttentionPrepare | 751.67 | 49.91
After
Kernel | Time (ms) | Percentage (%)
-- | -- | --
Attention\|MatMul | 120.87 | 19.77
Attention\|SplitPackedQKV | 1.94 | 0.32