DeepSpeed
NCCL based 1-bit Implementation + Refactor to add communication backends
#593
Merged

NCCL based 1-bit Implementation + Refactor to add communication backends #593

awan-10 merged 29 commits into staging-1bit-nccl-v1 from amawa/1-bit-refactor
awan-10
awan-10 add nccl 1-bit optim.
47788326
awan-10 temporary commit to save stuff.
567232be
awan-10 Use dist collectives instead of mpi routines.
79f64049
awan-10 Merge branch 'master' into amawa/1bit-adam-nccl
39b5949d
awan-10 remove old code for comm.
57ab220a
awan-10 Fix bugs. still does not work.
ebec1fee
awan-10 modify to test the nccl side code path
3e6974d1
awan-10 Initial gather impl. Works intra-node.
a72049b6
awan-10 Updates to comm. phase 2. nccl comm. passed the tests.
1bf1c275
awan-10 refactor code to introduce nccl/mpi as backends for onebit adam.
886ebb52
awan-10 Refactor updates to test/engine.
a38351ec
awan-10 Merge branch 'master' into amawa/1-bit-refactor
716ac132
awan-10 Fix compile/runtime errors.
be75d885
awan-10 simplify support for nccl/mpi backends.
7b7f122b
awan-10 Add missign file
fd2c366f
awan-10 Add compression backend in constructor. Revert later.
df8c40d3
awan-10 modify test with some perf counting.
f29ea3f3
awan-10 Implement a true non-blocking gather for nccl side.
170ef020
awan-10 Revert "Add compression backend in constructor. Revert later."
e2ddf489
awan-10 improve the 1-bit adam test.
dbd3cff5
awan-10 Refactor comm. and compression backend in 1-bit adam.
7edc3ab2
awan-10 Fix the test.
0813d117
awan-10 Fix runtime errors and typos in nccl backend
4c3c7772
awan-10 fix mpi backend. modify tests.
d495c7a2
awan-10 modify nccl perf test.
60f3344b
awan-10 fix mpi side errors.
c1ab39e0
awan-10 Add an mpi perf test
70938e17
awan-10 Merge branch 'master' into amawa/1-bit-refactor
de634979
awan-10 awan-10 requested a review from arashashari arashashari 5 years ago
awan-10 awan-10 requested a review from cli99 cli99 5 years ago
awan-10 awan-10 requested a review from conglongli conglongli 5 years ago
awan-10 awan-10 requested a review from eltonzheng eltonzheng 5 years ago
awan-10 awan-10 requested a review from jeffra jeffra 5 years ago
awan-10 awan-10 requested a review from minjiaz minjiaz 5 years ago
awan-10 awan-10 requested a review from niumanar niumanar 5 years ago
awan-10 awan-10 requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 5 years ago
awan-10 awan-10 requested a review from samyam samyam 5 years ago
awan-10 awan-10 requested a review from ShadenSmith ShadenSmith 5 years ago
awan-10 awan-10 requested a review from tjruwase tjruwase 5 years ago
awan-10 Sync DSE.
7aac0188
jeffra
jeffra approved these changes on 2020-12-10
awan-10 awan-10 merged 3e85a17b into staging-1bit-nccl-v1 5 years ago
mrwyattii mrwyattii deleted the amawa/1-bit-refactor branch 2 years ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone