onnxruntime
2bd00ec4 - [webgpu] Optimize FlashAttention for prefill (#25395)

Commit
139 days ago
[webgpu] Optimize FlashAttention for prefill (#25395) ### Description This PR enhances unidirectional `FlashAttention` by applying causal masking inside the main loop. This optimization eliminates unnecessary memory loads by avoiding future entries in the KV cache. Testing on Lunar Lake shows up to a 20% performance improvement for `phi-4-mini-accuracy4` (with a prompt of 4096). Similar performance gains were also observed for other models, including `Qwen3-0.6B-accuracy4`. This PR now uses the more readable `unidirectional` attribute instead of `is_gqa`, to control causal masking. ### Motivation and Context See above.
Author
Parents
Loading