onnxruntime
8af9f580 - Fix Local Attention off by 1 bug (#25927)

Commit
118 days ago
Fix Local Attention off by 1 bug (#25927) ### Description Previously, local window size of GQA op excluded the current token. This does not match standard HuggingFace implementations where tokens are appended and then local masking occurs; the mismatch can cause the mask to be off by 1 during generation, leading to accuracy issues. This PR corrects this mismatch by including the current token. In practice, this effectively decreases GQA window size by 1. ### Motivation and Context This helps align our models with HuggingFace models. --------- Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Author
Parents
Loading