DeepSpeed
fd0a52c1 - use all_gather_into_tensor instead of all_gather (#4705)

Commit
2 years ago
use all_gather_into_tensor instead of all_gather (#4705) when using allgather, the output is a list, and in the implementation of torch, the list will be flattened and unflattened, which will result in additional allocation of GPU memory and D2D operations. But these all gather operations already have a flat GPU memory, using all_gather_into_tensor replaces all_gather will save GPU memory allocation and additional D2D operations. additionally, batching all gatherers does not reduce the peak usage of GPU memory, so allgather_bucket_size has no effect. Signed-off-by: --local <zhiwei.tao@enflame-tech.com> Co-authored-by: --local <zhiwei.tao@enflame-tech.com>
Author
Parents
Loading