Get flash_attn to compile for CUDA 11.6 linux nightly build (#84941)
This PR only attempts to get this code to compile for all archs so that we can dispatch to it in https://github.com/pytorch/pytorch/pull/84653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84941
Approved by: https://github.com/drisspg, https://github.com/malfet