PR #271 Sync layer norm

Sync layer norm #271

thomasw21 wants to merge 38 commits into thomas/test_different_layer_norm from thomas/fix_layer_norm

Better

07ccb3db

Force synchronize the layer norms parameters across all TP

391ed488

thomasw21 changed the base branch from main to thomas/test_different_layer_norm 3 years ago

import mpu

98d0e7cc

use the bf16 branch for testing

279a77eb

`torch.testing.assert_equal` didn't make it (#273)

87a9dba0

Merge remote-tracking branch 'origin/main' into thomas/fix_layer_norm

dbb59140

bf16 comms requite pt-1.11

70f91f82

already part of the function

835a3e5c

reproduce the crashing on resume

37795a92

stas00 commented on 2022-03-25

run just the test we want for now

3ec65f7c

all_reduce is an in_place operation

8271d419

thomasw21 commented on 2022-03-25

Make a test that TP reshaping works

b418b47a

Woops

4b7207b5

Woops

3bc58243

Woops

05c99db6

Woops

55e10c63

Woops

2ab8a3ac

Woops

d357839d

Woops

5fb231c1

Woops

cc7ff45b

Woops

7cdb1be8

Fix load issue

4574ec97

Woops

04e89d14

Fix checkpoint path

e9431002

Test that force sync will allow TP changes

09cead38

Nit

77abee61

Now that we have a force sync mechanism, let's try to reproduce

64a62c80

Compare model_states_rank

0b7afcc9

test

ce017338

Row column bias should be synchronized as well

89ab0b72

New list of matching embeddings

42997b2a

Figure out why state differs

e0ef1683

Test for final weight

1fc4fe82

Test that torch_rng_state

7ebbed16

Fix non matching torch_rng_state for tp_rank=0

2c49216a

Update test

007ecb4b

I'm surprised one can apply inplace operation here

c3844b5c

Test out the loss from the fp32 weights and optimizer states

189f0547

Reviewers

stas00

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

Megatron-DeepSpeed Sync layer norm #271 Open

Sync layer norm #271

Megatron-DeepSpeed
Sync layer norm
#271

Open