DeepSpeed
1bc3b784 - [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919)

Commit
2 years ago
[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919) * use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) * add fp32 support for SHM allreduce * avoid assertion for FP16 data type * fix format * change 'allreduce_low_latency' to 'inference_allreduce' * Fix according to comments * change inference_allreduce to inference_all_reduce to keep naming consistency * check whether LOCAL_SIZE is defined in ccl.cpp, also define LOCAL_SIZE in test_distributed * fix format * Fix format error * Update tests/unit/comm/test_dist.py Fix world_size to 4 in UT Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Author
Parents
Loading