Update xfails for scaled_dot_product_attention (#120928)
Update xfails for test_dispatch_meta_outplace and test_dispatch_symbolic_meta_outplace.
These tests are sometimes expected to fail, because we moved the registrations from meta_registrations.py to fake_impls.py. AFAIK, this is okay because fake tensors will still work because we have special handling in fake_impls.py. The purpose of this PR is to update the xfails so they are correctly xfailing the failing tests.
Previously, I set these to xfail only for bfloat16, float16, and float32, but not float64; but this isn't really correct. Explanation below:
Scaled dot product attention (SDPA) has multiple implementations, including efficient_attention, flash_attention, and unfused attention. flash_attention supports fp16, bf16. efficient_attention supports fp16, bf16, fp32. unfused attention supports all dtypes.
efficient_attention and flash_attention implementations will fail the meta tests, but the unfused attention will not. Certain platforms may support none, both, or one of efficient_attention and flash_attention. Unfused attention will pass because it falls back to constituent ops which have registered meta kernels.
So: on CUDA, we have all 3 available: in bf16, fp16, fp32, we'll select one of the fused implementations (where this test will fail).
On ROCM, we don't have efficient_attention: so fp32 will use the unfused implementation, where the test will pass.
Fix in this PR:
* If any fused impl is available, then xfail float16 & bfloat16
* If efficient_attention is available, then also xfail float32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120928
Approved by: https://github.com/drisspg