Enable pg_nccl to perform vector AllGather for uneven output splits (#83713)
Pushing PR on behalf of @aashaka
To replace: https://github.com/pytorch/pytorch/pull/82835
Summary:
A vector all_gather requires each process to gather other process' inputs into an output tensor according to the ouput list provided. Internally, pg_nccl.allgather will coalesce a list of pg_nccl._broadcast_oop to implement a vector all-gather in the case when the any shape is different in the output list. Otherwise, it will perform a ncclAllGather as usual.
- This change adds an out-of-place `_broadcast_oop` function to ProcessGroupNCCL. It allows broadcasting an input tensor and placing the output in a separate output tensor. Since allgather provides an out-of-place API, an allgather_v semantic implemented inside `pg_nccl.allgather` also needs to support out-of-place, for which an out-of-place broadcast is required to be added.
Test Plan: Added a new test `test_all_gather_v_cuda` for all_gather_v to `distributed_nccl_spawn`.
Differential Revision: D37735263
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83713
Approved by: https://github.com/mingzhe09088