[CPU] Support SHM based inference_all_reduce in TorchBackend (#5391)
This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.
Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.
| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |
In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>