[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919)
* use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node)
* add fp32 support for SHM allreduce
* avoid assertion for FP16 data type
* fix format
* change 'allreduce_low_latency' to 'inference_allreduce'
* Fix according to comments
* change inference_allreduce to inference_all_reduce to keep naming consistency
* check whether LOCAL_SIZE is defined in ccl.cpp, also define LOCAL_SIZE in test_distributed
* fix format
* Fix format error
* Update tests/unit/comm/test_dist.py
Fix world_size to 4 in UT
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>