DeepSpeed
1bc3b784 - [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919)

Commit

2 years ago

[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919) * use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) * add fp32 support for SHM allreduce * avoid assertion for FP16 data type * fix format * change 'allreduce_low_latency' to 'inference_allreduce' * Fix according to comments * change inference_allreduce to inference_all_reduce to keep naming consistency * check whether LOCAL_SIZE is defined in ccl.cpp, also define LOCAL_SIZE in test_distributed * fix format * Fix format error * Update tests/unit/comm/test_dist.py Fix world_size to 4 in UT Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

References

#3919 - [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node)

Author

delock

Parents

8afcda2a

DeepSpeed 1bc3b784 - [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919)

DeepSpeed
1bc3b784 - [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919)