[sdpa] Add broadcasting for batch and num_heads dimensions to fused kernel nested preproc (#95657)
Adds a path with the strategy mentioned [here](https://github.com/pytorch/pytorch/pull/95346#issuecomment-1441283506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95657
Approved by: https://github.com/drisspg