[dtensor] implement scaled dot product attention (flash-attention) (#120298)
as titled, this PR implements the sdpa flash attention op in DTensor
Adding flash attention first but efficient attention and other attention
ops should be similar
fixes https://github.com/pytorch/pytorch/issues/120333
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120298
Approved by: https://github.com/XilunWu
ghstack dependencies: #120297