onnxruntime
8af9f580 - Fix Local Attention off by 1 bug (#25927)

Commit

144 days ago

Fix Local Attention off by 1 bug (#25927) ### Description Previously, local window size of GQA op excluded the current token. This does not match standard HuggingFace implementations where tokens are appended and then local masking occurs; the mismatch can cause the mask to be off by 1 during generation, leading to accuracy issues. This PR corrects this mismatch by including the current token. In practice, this effectively decreases GQA window size by 1. ### Motivation and Context This helps align our models with HuggingFace models. --------- Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>

References

#25927 - Fix Local Attention off by 1 bug

Author

aciddelgado

Parents

978bfca5

onnxruntime 8af9f580 - Fix Local Attention off by 1 bug (#25927)

onnxruntime
8af9f580 - Fix Local Attention off by 1 bug (#25927)