Megatron-DeepSpeed
Sync 4 layer norms - bf16, fp32, optimizer states on restart
#274
Open

Sync 4 layer norms - bf16, fp32, optimizer states on restart #274

tjruwase wants to merge 40 commits into main from olruwase/sync_layer_norms
tjruwase
thomasw21 WIP
8d7a6038
thomasw21 Wip
240f673e
thomasw21 Woops
1cdcd7de
thomasw21 WIP
29372806
thomasw21 Woops
7fcff06b
thomasw21 Woops
1f2f8007
thomasw21 Woops
f152e487
thomasw21 Test with alibi
ce02dd16
thomasw21 Still trying to reproduce
02365d14
thomasw21 Huh
42d6b4e3
thomasw21 Have high LR to see weights actually change
c20c8ba4
thomasw21 Launch bf16
7f2441ed
thomasw21 Woops
a4172bf9
thomasw21 Make test to work with both bf16 and fp16 to see who fails
5fbe1072
thomasw21 Woops
a0c09132
thomasw21 Remove assert
6b19339c
thomasw21 Try to figure out how the divergence happens
a5e32958
thomasw21 I think bias starts to diverge first
7145f6df
thomasw21 Woops
311e5317
thomasw21 Woops
39d4b8f9
thomasw21 Woops
8ffb278f
thomasw21 Add embed layer norm
2389bfdf
thomasw21 Woops
0cf35ee3
thomasw21 Backward compatibility on torch
f0d6d179
thomasw21 Better
07ccb3db
stas00 Merge remote-tracking branch 'origin/main' into thomas/test_different…
3c5e4914
stas00 fix
a5b5edc0
tjruwase Sync lp/hp/optim for layer norms
c7f20066
stas00 fix requirements
8f2ea60b
stas00 dynamically discovered layer norm weights / refactor
fc8f813d
stas00 fix regex
4443e6d2
stas00 add the test script
d2aa4f18
stas00 stas00 changed the title Olruwase/sync layer norms Sync 4 layer norms - bf16, fp32, optimizer states on restart 3 years ago
stas00 compare on cpu
d64a947e
stas00 add 2 more weights to sync
bf7eeb3a
tjruwase fp32 accessors
84825956
stas00 improve the doc, and comment out the demo
86b726cb
stas00 typo
2ac141b1
thomasw21 Sync torch_rng_state (#277)
d576775c
thomasw21 Fix device issue when using torch.broadcast
475f3730
stas00 Merge remote-tracking branch 'origin/main' into olruwase/sync_layer_n…
5b368846

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone