[pytorch] Add decomp rule for scaled_dot_product_attention (#108180)
`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity.
However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor.
Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180
Approved by: https://github.com/SherlockNoMad