DeepSpeed
fd0a52c1 - use all_gather_into_tensor instead of all_gather (#4705)

Commit

2 years ago

use all_gather_into_tensor instead of all_gather (#4705) when using allgather, the output is a list, and in the implementation of torch, the list will be flattened and unflattened, which will result in additional allocation of GPU memory and D2D operations. But these all gather operations already have a flat GPU memory, using all_gather_into_tensor replaces all_gather will save GPU memory allocation and additional D2D operations. additionally, batching all gatherers does not reduce the peak usage of GPU memory, so allgather_bucket_size has no effect. Signed-off-by: --local <zhiwei.tao@enflame-tech.com> Co-authored-by: --local <zhiwei.tao@enflame-tech.com>

References

#4705 - use all_gather_into_tensor instead of all_gather

Author

taozhiwei

Parents

6b8103b4

DeepSpeed fd0a52c1 - use all_gather_into_tensor instead of all_gather (#4705)

DeepSpeed
fd0a52c1 - use all_gather_into_tensor instead of all_gather (#4705)