pytorch
653dc73d - [SDPA] Wire up FlashAttention's backward (#92917)

Commit View On GitHub

Commit

1 year ago

[SDPA] Wire up FlashAttention's backward (#92917) # Summary This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml. The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](https://github.com/HazyResearch/flash-attention/blob/33e0860c9c5667fded5af674882e731909096a7f/flash_attn/flash_attn_interface.py#L126) natively in PyTorch. One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions. ### MetaFunctions I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below. ### Questions? Performance issues vs mem efficient when using torch.nn.mha_forward TorchCompile -> See purposed solution below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917 Approved by: https://github.com/cpuhrsch

Author

drisspg

Committer

pytorchmergebot

Parents

b6367c8a

pytorch 653dc73d - [SDPA] Wire up FlashAttention's backward (#92917)

Commit

pytorch
653dc73d - [SDPA] Wire up FlashAttention's backward (#92917)