Round of fixes for functional collectives (#95714)
Move collective registration to torch.__init__ to handle multipy warmup.
Fix all_reduce with non-contiguous tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95714
Approved by: https://github.com/wconstab