Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits (#82924)
Summary: A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual.
- This change adds a `CoalescedWorkNCCL` class which encapsulates the WorkNCCL requests from coalesced operations. A `.wait()` on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced.
- This change adds an out-of-place `_reduce_oop` function to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented inside `pg_nccl.reduce_scatter` also needs to support out-of-place, for which an out-of-place reduce is required to be added.
Test Plan: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`.
Differential Revision: D38478781
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82924
Approved by: https://github.com/kwen2501