PR #5320 CPU SHM based inference_all_reduce improve

move naive all reduce into seperate function

delock committed 2 years ago

seperate allreduce outer loop and inner loop into different functions

delock committed 2 years ago

skeleton for ring allreduce

delock committed 2 years ago

interface finetune

delock committed 2 years ago

initial ring allreduce implementation (no sync yet)

delock committed 2 years ago

ring allreduce can run (correctness not ensured)

delock committed 2 years ago

change barrier to sync

delock committed 2 years ago

change workspace to pointer array

delock committed 2 years ago

fix minor error

delock committed 2 years ago

better state handling for ring allreduce

delock committed 2 years ago

fix accuracy error

delock committed 2 years ago

fix state handling

delock committed 2 years ago

seperate buffer per rank (but will hang)

delock committed 2 years ago

Merge branch 'master' into gma/ring_allreduce

delock committed 2 years ago

cleanup

delock committed 2 years ago

per rank SHM passed

delock committed 2 years ago

Using ring allreduce instead of naive allreduce

delock committed 2 years ago

finetune buffer size and max number of ranks

delock committed 2 years ago

cleanup code

delock committed 2 years ago

fix hang with >2 ranks

delock committed 2 years ago

use ring_allreduce for bufsize >1MB only

delock committed 2 years ago

fix for 3 ranks

delock committed 2 years ago

support fp32 in ring allreduce

delock committed 2 years ago

use naive allreduce for message < 1MB

delock committed 2 years ago

remove unused functions

delock committed 2 years ago

enable distributed_naive allreduce

delock committed 2 years ago

pass 3 ranks

delock committed 2 years ago

add shm.cpp

delock committed 2 years ago

split shm based collective into seperate file, no dep on oneCCL

delock committed 2 years ago

remove unneeded head files

delock committed 2 years ago

add timer to check variance at C++ level

delock committed 2 years ago

add time profiling

delock committed 2 years ago

Merge branch 'master' into gma/shm_allreduce_improve

delock committed 2 years ago

Merge branch 'master' into gma/shm_allreduce_improve

loadams committed 2 years ago

Formatting

loadams committed 2 years ago

Merge branch 'master' into gma/shm_allreduce_improve

tjruwase committed 2 years ago

DeepSpeed CPU SHM based inference_all_reduce improve #5320 Merged

DeepSpeed
CPU SHM based inference_all_reduce improve
#5320

Merged