Fix GPT2 attention scaling ignored in SDPA/FlashAttention (#44397)
* Fix GPT2 attention scaling config ignored in SDPA/FlashAttention backends
GPT2Attention.forward() did not pass the `scaling` parameter to
`attention_interface`, causing `scale_attn_weights` and
`scale_attn_by_inverse_layer_idx` config options to be silently
ignored when using SDPA or FlashAttention backends.
Compute the combined scaling factor in __init__ (following the pattern
used by LLaMA and other models) and forward it to the attention
interface so all backends produce consistent results.
Fixes #44380
* Sync scaling fix to DecisionTransformerGPT2Attention (Copied from GPT2Attention)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Address review: use self.scaling in _upcast_and_reordered_attn, improve tests
Per reviewer feedback:
- Replace inline scale_factor computation with self.scaling in
_upcast_and_reordered_attn for both GPT2 and DecisionTransformer
- Use model.set_attn_implementation() instead of model reloading in tests
- Add FlashAttention2 vs eager comparison test
* Address review: refactor eager_attention_forward to use scaling param, fix test decorators and tolerances
- Refactor eager_attention_forward to accept scaling/dropout params (like Bert's pattern)
instead of reading module.scale_attn_weights/scale_attn_by_inverse_layer_idx directly
- Reorganize GPT2Attention.__init__ to group config attrs together with clearer comment
- Sync DecisionTransformerGPT2Attention (Copied from GPT2Attention)
- Tests: use atol=1e-4/rtol=1e-4 for SDPA, atol=1e-2/rtol=1e-2 for FA2
- Tests: add @require_torch_gpu and @pytest.mark.flash_attn_test to FA2 test
- Tests: fix FA2 comment to clarify the bug being tested
* Add issue reference to regression tests
Link both SDPA and FA2 scaling tests to the original issue #44380.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>