Better
07ccb3db
Force synchronize the layer norms parameters across all TP
391ed488
thomasw21
changed the base branch from
main
to
thomas/test_different_layer_norm
3 years ago
import mpu
98d0e7cc
use the bf16 branch for testing
279a77eb
`torch.testing.assert_equal` didn't make it (#273)
87a9dba0
Merge remote-tracking branch 'origin/main' into thomas/fix_layer_norm
dbb59140
bf16 comms requite pt-1.11
70f91f82
already part of the function
835a3e5c
reproduce the crashing on resume
37795a92
stas00
commented
on 2022-03-25
stas00
commented
on 2022-03-25
run just the test we want for now
3ec65f7c
all_reduce is an in_place operation
8271d419
Make a test that TP reshaping works
b418b47a
Woops
4b7207b5
Woops
3bc58243
Woops
05c99db6
Woops
55e10c63
Woops
2ab8a3ac
Woops
d357839d
Woops
5fb231c1
Woops
cc7ff45b
Woops
7cdb1be8
Fix load issue
4574ec97
Woops
04e89d14
Fix checkpoint path
e9431002
Test that force sync will allow TP changes
09cead38
Nit
77abee61
Now that we have a force sync mechanism, let's try to reproduce
64a62c80
Compare model_states_rank
0b7afcc9
test
ce017338
Row column bias should be synchronized as well
89ab0b72
New list of matching embeddings
42997b2a
Figure out why state differs
e0ef1683
Test for final weight
1fc4fe82
Test that torch_rng_state
7ebbed16
Fix non matching torch_rng_state for tp_rank=0
2c49216a
Update test
007ecb4b
I'm surprised one can apply inplace operation here
c3844b5c
Test out the loss from the fp32 weights and optimizer states
189f0547
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub