Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042)
Summary:
While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does.
Also marking a test affected by this with `@run_with_both_funcol_impls`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042
Approved by: https://github.com/wanchaol