use all_gather_into_tensor instead of all_gather (#4705)
when using allgather, the output is a list, and in the implementation of
torch, the list will be flattened and unflattened, which will result in
additional allocation of GPU memory and D2D operations. But these all
gather operations already have a flat GPU memory, using
all_gather_into_tensor replaces all_gather will save GPU memory
allocation and additional D2D operations.
additionally, batching all gatherers does not reduce the peak usage of
GPU memory, so allgather_bucket_size has no effect.
Signed-off-by: --local <zhiwei.tao@enflame-tech.com>
Co-authored-by: --local <zhiwei.tao@enflame-tech.com>