DeepSpeed
eda5075b - [CPU] SHM based allreduce improvement for small message size (#5571)

Commit

1 year ago

[CPU] SHM based allreduce improvement for small message size (#5571) On CPU server, when running SHM based allreduce for small messages, the performance is pretty much dominated by synchronization latency. These latency includes the following two situations: 1. Wait for status change from other ranks. 2. Use `#pragma omp parallel for` to accelerator memory bandwidth bound operations such as `parallel_memcpy` or `reduce`. Each synchronization add a little time to allreduce latency. In current implementation, for small messages, 5 syncs on rank 0 are needed. This includes: 1) copy-in; 2) wait for other ranks done copy; 3) reduce; 4) copy-out; 5) wait for other ranks finish copy-out We redesign the algorithm for small message allreduce (called `symmetric_naive_allreduce`) to have only three syncs, each rank do exactly the same steps: 1) copy-in; 2) wait for other ranks done copy; 3) reduce to output buffer directly. We use double buffer so we can skip the last wait and go directly to next call using another buffer. We have a carefully designed state check to avoid using global barrier among ranks. Test shows for message size < 1MB, allreduce latency will reduce 30% to 50%. This is especially helpful for tensor parallel decoding with small batch size, where the tensor size is usually a few 10s of KBytes. |message size(bytes)|new method latency(us)|old method latency(us)| |---|---|---| | 2 | 13.34|20.39 | 4 | 13.44|19.57 | 8 | 13.70|19.76 | 16 | 13.27|20.43 | 32 | 13.42|19.75 | 64 | 13.38|19.80 | 128 | 13.70|19.44 | 256 | 13.99|20.33 | 512 | 13.91|20.28 | 1024 | 15.00|22.86 | 2048 | 15.82|20.93 | 4096 | 16.00|21.08 | 8192 | 16.31|21.50 | 16384 | 16.27|22.95 | 32768 | 16.13|25.17 | 65536 | 18.92|25.90 | 131072 | 21.12|27.42 | 262144 | 23.09|32.36 | 524288 | 32.78|42.80 Because the new method would compute same reduce value on all ranks. Caution needs to be taken to ensure the result is identical on all ranks. We use the test in the link https://github.com/delock/ds_allreduce_bench/blob/main/ds_comm_bench.py#L70 to ensure the implementation is correct. https://github.com/delock/ds_allreduce_bench/blob/main/validate.sh is a test script for better coverage. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>

References

#5571 - [CPU] SHM based allreduce improvement for small message size

Author

delock

Parents

dfcade24

DeepSpeed eda5075b - [CPU] SHM based allreduce improvement for small message size (#5571)

DeepSpeed
eda5075b - [CPU] SHM based allreduce improvement for small message size (#5571)