DeepSpeed
CPU SHM based inference_all_reduce improve
#5320
Merged

Commits
  • move naive all reduce into seperate function
    delock committed 2 years ago
  • seperate allreduce outer loop and inner loop into different functions
    delock committed 2 years ago
  • skeleton for ring allreduce
    delock committed 2 years ago
  • interface finetune
    delock committed 2 years ago
  • initial ring allreduce implementation (no sync yet)
    delock committed 2 years ago
  • ring allreduce can run (correctness not ensured)
    delock committed 2 years ago
  • change barrier to sync
    delock committed 2 years ago
  • change workspace to pointer array
    delock committed 2 years ago
  • fix minor error
    delock committed 2 years ago
  • better state handling for ring allreduce
    delock committed 2 years ago
  • fix accuracy error
    delock committed 2 years ago
  • fix state handling
    delock committed 2 years ago
  • seperate buffer per rank (but will hang)
    delock committed 2 years ago
  • Merge branch 'master' into gma/ring_allreduce
    delock committed 2 years ago
  • cleanup
    delock committed 2 years ago
  • per rank SHM passed
    delock committed 2 years ago
  • Using ring allreduce instead of naive allreduce
    delock committed 2 years ago
  • finetune buffer size and max number of ranks
    delock committed 2 years ago
  • cleanup code
    delock committed 2 years ago
  • fix hang with >2 ranks
    delock committed 2 years ago
  • use ring_allreduce for bufsize >1MB only
    delock committed 2 years ago
  • fix for 3 ranks
    delock committed 2 years ago
  • support fp32 in ring allreduce
    delock committed 2 years ago
  • use naive allreduce for message < 1MB
    delock committed 2 years ago
  • remove unused functions
    delock committed 2 years ago
  • enable distributed_naive allreduce
    delock committed 2 years ago
  • pass 3 ranks
    delock committed 2 years ago
  • add shm.cpp
    delock committed 2 years ago
  • split shm based collective into seperate file, no dep on oneCCL
    delock committed 2 years ago
  • remove unneeded head files
    delock committed 2 years ago
  • add timer to check variance at C++ level
    delock committed 2 years ago
  • add time profiling
    delock committed 2 years ago
  • Merge branch 'master' into gma/shm_allreduce_improve
    delock committed 2 years ago
  • Merge branch 'master' into gma/shm_allreduce_improve
    loadams committed 2 years ago
  • Formatting
    loadams committed 2 years ago
  • Merge branch 'master' into gma/shm_allreduce_improve
    tjruwase committed 2 years ago
Loading