transformers
8757098b - Fix GPT2 attention scaling ignored in SDPA/FlashAttention (#44397)

Commit

137 days ago

Fix GPT2 attention scaling ignored in SDPA/FlashAttention (#44397) * Fix GPT2 attention scaling config ignored in SDPA/FlashAttention backends GPT2Attention.forward() did not pass the `scaling` parameter to `attention_interface`, causing `scale_attn_weights` and `scale_attn_by_inverse_layer_idx` config options to be silently ignored when using SDPA or FlashAttention backends. Compute the combined scaling factor in __init__ (following the pattern used by LLaMA and other models) and forward it to the attention interface so all backends produce consistent results. Fixes #44380 * Sync scaling fix to DecisionTransformerGPT2Attention (Copied from GPT2Attention) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address review: use self.scaling in _upcast_and_reordered_attn, improve tests Per reviewer feedback: - Replace inline scale_factor computation with self.scaling in _upcast_and_reordered_attn for both GPT2 and DecisionTransformer - Use model.set_attn_implementation() instead of model reloading in tests - Add FlashAttention2 vs eager comparison test * Address review: refactor eager_attention_forward to use scaling param, fix test decorators and tolerances - Refactor eager_attention_forward to accept scaling/dropout params (like Bert's pattern) instead of reading module.scale_attn_weights/scale_attn_by_inverse_layer_idx directly - Reorganize GPT2Attention.__init__ to group config attrs together with clearer comment - Sync DecisionTransformerGPT2Attention (Copied from GPT2Attention) - Tests: use atol=1e-4/rtol=1e-4 for SDPA, atol=1e-2/rtol=1e-2 for FA2 - Tests: add @require_torch_gpu and @pytest.mark.flash_attn_test to FA2 test - Tests: fix FA2 comment to clarify the bug being tested * Add issue reference to regression tests Link both SDPA and FA2 scaling tests to the original issue #44380. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

References

#44397 - Fix GPT2 attention scaling ignored in SDPA/FlashAttention

Author

weiguangli-io

Parents

22c35ca5

transformers 8757098b - Fix GPT2 attention scaling ignored in SDPA/FlashAttention (#44397)

transformers
8757098b - Fix GPT2 attention scaling ignored in SDPA/FlashAttention (#44397)