Custom sdp implementations dense (#85984)
# Summary
- This code creates the runtime dispatch system for choosing a performant fused SDP kernel. The only choice of fused kernel is flash_attention. It also creates python flags and a context manager that can be used to turn off and on behavior for dispatch.
- This also adds support for flash_attention with dense tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85984
Approved by: https://github.com/cpuhrsch