SemanticDiff

pytorch
0fb1c05c - [pytorch] Add decomp rule for scaled_dot_product_attention (#108180)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

1 year ago

[pytorch] Add decomp rule for scaled_dot_product_attention (#108180) `scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180 Approved by: https://github.com/SherlockNoMad

Author

larryliu0820

larryliu0820

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading