Support noncontiguous inputs for torch.distributed.nn.functional.all_gather/reducescatter/gather
Fixes #73515
The backward for AllGather is ReduceScatter. I am wondering is there a deeper reason why it's currently implemented as All2All with explicit sum.
ReduceScatter also has a lower communication payload than All2All.
In addition, dist.reduce_scatter accepts non-contiguous input_tensor_list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75276
Approved by: https://github.com/H-Huang