PR #300 [MLM] Train script for non causal decoder - SemanticDiff

[MLM] Train script for non causal decoder #300

thomasw21 wants to merge 298 commits into main from thomas/mlm_train_script

thomasw21

made into input and output tokens

6ad61b6d

added eos

9131fdd7

added eos

cb76cd31

test text_token

531ee688

test text_token

a7d11583

test text_token

0008cfb0

test text_token

f1461a83

test text_token

ada0f100

assigned array

298c9b71

assigned array

d2bdff6e

assigned array

4ec8db32

hardcoded sequence length

10a2b6d1

check again

a373a700

lintangsutawika

show sentinal tokens

bdef71b0

lintangsutawika

show sentinal tokens

262fd6ce

lintangsutawika

show sentinal tokens

68a6a936

lintangsutawika

show sentinal tokens

1c00d4bb

lintangsutawika

add more special tokens

8b85f113

lintangsutawika

changed how mlm data is loaded

85d204af

lintangsutawika

changed how mlm data is loaded

4c842748

lintangsutawika

changed how mlm data is loaded

084245e5

lintangsutawika

changed how mlm data is loaded

32af10e8

lintangsutawika

changed how mlm data is loaded

b6e0e636

lintangsutawika

added new script

2af2e4b7

lintangsutawika

added new script

cc5968e5

lintangsutawika

added new script

cf0b2a0f

lintangsutawika

try t5 dataset

fc150a05

lintangsutawika

try t5 dataset

039f90f2

lintangsutawika

try t5 dataset

7364781e

lintangsutawika

try t5 dataset

5b1100a4

lintangsutawika

try t5 dataset

45102a93

lintangsutawika

try t5 dataset

7b2ebbf7

lintangsutawika

try t5 dataset

fe8b3dc0

lintangsutawika

try t5 dataset

f456725e

lintangsutawika

try t5 dataset

ae73d8cf

lintangsutawika

try t5 dataset

fae6a0bd

lintangsutawika

try t5 dataset

81858424

lintangsutawika

try t5 dataset

9deef493

lintangsutawika

developing

1e78a4bd

lintangsutawika

developing

9070929d

lintangsutawika

developing

56c69de0

lintangsutawika

developing

d1ca9143

lintangsutawika

developing

13af6234

lintangsutawika

developing

dbc555e1

lintangsutawika

developing

12b209dd

lintangsutawika

test to see output of get_ltor_masks_and_position_ids

698eff05

lintangsutawika

test to see output of get_ltor_masks_and_position_ids

dae3cc6c

add new script

5c109c3c

add new script

2fc99951

add new script

ee7af99a

changed settings

b6701a85

changed settings

2283e581

tidy up

9d00a49f

changed tokenizer and position embedding

0298fde9

modifying mlm to reflect original implementation

bde07f08

minor fix

4c0ca2e1

minor fix

0c05596d

minor fix

30f69248

minor fix

84408ef0

minor fix

ad964c58

minor fix

45899e98

minor fix

0b945972

minor fix

2b54cc17

minor fix

ec616272

minor fix

4448d1d3

minor fix

ecd148c7

minor fix

a99f30f0

minor fix

62d3e3e9

minor fix

a1608531

minor fix

fe205f77

minor fix

d39bdaf9

minor fix

2530d3e0

minor fix

5e93c47a

minor fix

ad867998

minor fix

82c8d932

minor fix

ebf3561d

minor fix

811f9755

minor fix

de7dfc83

minor fix

be2af770

minor fix

5e7e18f4

minor fix

24d4f25d

minor fix

5926be1c

minor fix

0f18174c

minor fix

58ce7144

set correct seq len

05470d7c

refined sampling method

51a23f23

refined sampling method

43cb2f04

refined sampling method

901defc8

refined sampling method

3130d7d1

refined sampling method

18eb53d7

refined sampling method

652c545c

first commit, adding non causal mlm dataset

5a49db8e

fixed mlm dataset

81b918c9

fixed mlm dataset

95afc4f0

fixed mlm dataset

c4514d8e

fixed mlm dataset

5cca5af4

fixed mlm dataset

ae958788

minor changes

a03e59f3

removed mlm related scripts

fa1e072d

removed any scipts not related to dataset, revert arguments

e3ce0a76

removed mlm related scripts

fa1e072d

added sampler and test

87e4055c

added testing data

0ae7661d

adapted test loader

71fb5aea

Update megatron/data/non_causal_mtf_dataset.py

be0cea2d

removed unused files

9daa3766

changed with impossible token

6b9e81a3

enable loading multiple indexed_dataset for each field

7feec27f

minor fix

f84f2935

data_prefix is set as dict

2778d8d8

removed sample_idx lines

61ac4b9d

change line from sample_idx to doc_idx

62e3fb13

replace shuffling _build_index_mappings with random.sample of the doc…

cb79f09e

minor changes

e9cf22a3

Muennighoff

Cleanup artefacts

acd87cd5

Muennighoff

Add packed preprocessing

019ed7c9

Muennighoff

Use seq_length arg

7619f7a6

Muennighoff

Add sources & docstrings

219209ac

added training process for t0

67424d6d

Update pretrain_t0.py

a7c424e6

thomasw21

Remove a bunch of code that's not needed

51d6c402

thomasw21

WIP

b4e374c4

thomasw21

Cleanup

0d2fdfd6

thomasw21

Add back all configs

126fa34c

thomasw21

Woops

83d24057

thomasw21

Fix tests

c93ed5ce

thomasw21

Rename testing files

528f5d34

thomasw21

Do in-place operations

8bed302d

thomasw21

Do in-place operations

bd2fede1

thomasw21

Woops

8593e425

thomasw21

Fix typo

a1eb558a

thomasw21

Add test that packing is done optimially via greedy algorithm

3bddafa8

thomasw21

Woops

45c94446

lintangsutawika

added capabilities for padding and prefix lm index

6f28ae45

added adjustments and new dataset

8a4d99b7

added sentinal tokens

ea445b15

made into input and output tokens

40708595

modifying mlm to reflect original implementation

85e84ecb

minor fix

39222938

added sampler and test

ee6438f1

Muennighoff

Enable training

a869adf5

Muennighoff

Add T0 training test

5ae15ef6

Muennighoff

Remove artefacts

efa55ea8

Muennighoff

Remove artefacts

f45266d1

thomasw21

WIP

8029564f

thomasw21

WIP

4faa7434

thomasw21

WIP

3a6d73d1

thomasw21

WIP

ea86bc8f

thomasw21

WIP

638fc567

thomasw21

move to cpu for comparison

66d2afe8

thomasw21

Use torch_assert_equal

3794b86a

thomasw21

WIP

346b08f9

thomasw21

Take in account pad + fix inverse

4203f6cb

thomasw21

Tensor and int can't be compared vi torch_assert_equal

bcba2b71

thomasw21

Woops

57156e1d

thomasw21

Test

45d92189

thomasw21

Woops

959fc71d

thomasw21

Remove unecessary unsqueeze

27197fce

thomasw21

Add necessary unsqueeze

b7374e1c

thomasw21

I'm stupid

4f6b7d32

thomasw21

I'm stupid

960b17cb

thomasw21

Tokenizers returns None when trying to access a non existing value

2b522d11

thomasw21

Force gpt2 to have a pad token

a8fcd386

thomasw21

Add a test that the packed_masking works in the modeling side

7181de45

thomasw21

Import error

172306b0

thomasw21

Tokenizer requires to have pad token

a4854bd2

thomasw21

Turns out that test_model.py did not use deepspeed version of models

06c29a9a

thomasw21

Use train_batch instead

aba48b3f

thomasw21

Make it work via DS

a9d423a4

thomasw21

Make it work via DS

6a95e25e

thomasw21

Make it work via DS

d6e435b1

thomasw21

Make it work via DS

ca8c04a7

thomasw21

Make it work via DS

f3231db3

thomasw21

Make it work via DS

987e6b4b

thomasw21

Make it work via DS

0b27fb67

thomasw21

Woops

1ba5d4a1

thomasw21

Make it work via DS

cbab16ca

thomasw21

Make it work via DS

4defbb2c

thomasw21

Make it work via DS

412939c0

thomasw21

Maybe

17a6cc0a

thomasw21

Make it work via DS

cb90679e

thomasw21

Woops

bd4a3f07

thomasw21

Try having very strict mask

66040354

thomasw21

Try updating the kernel

d98e39a5

thomasw21

Try updating the kernel

84950834

thomasw21

Try updating the kernel

ef5d4d4d

thomasw21

Try updating the kernel

69912b3f

thomasw21

Try updating the kernel

866fc56e

thomasw21

Try updating the kernel

8e9701b3

thomasw21

Inverse causal masking

15d95faf

thomasw21

Check that the padding are ignored

fe4f806c

thomasw21

Fix test

cc2aff57

thomasw21

Probably should be in this order:

93cde870

thomasw21

Revert "Probably should be in this order:"

f6d717b4

thomasw21

Add a test checking that ScaledMaskedSoftmax custom kernel does what …

910f93b9

thomasw21

Head specific mask is not implemented

75f99ef7

thomasw21

Test something out

c34f1073

thomasw21

Test something out

ed6131aa

thomasw21

Test something out

3a846a0a

thomasw21

Test something out

5746641e

thomasw21

Test something out

292620c4

thomasw21

Test something out

0e1ef5dc

thomasw21

Test something out

964a275f

thomasw21

Test something out

8b31e9ca

thomasw21

Test something out

723a5b39

thomasw21

Test something out

65b4ea28

thomasw21

Maybe nothing is wrong

7eaced45

thomasw21

Woops

da9f3160

thomasw21

Use bloom instead

8b67bd98

thomasw21

Make MTF dataloader an infinite dataloader

84007bc2

thomasw21

Work into moving packing logic into a dataset

273d420b

thomasw21

Woops

688d06e4

thomasw21

Woops

ddc6a61a

thomasw21

Woops

0e34e8d1

thomasw21

Woops

014b8b82

thomasw21

Woops

c53622a9

thomasw21

Woops

ea221a88

thomasw21

Woops

32749863

thomasw21

Woops

9a5bf96d

thomasw21

Woops

d1605898

thomasw21

Woops

c3ab5b95

thomasw21

Woops

f5410765

thomasw21

Requires to remember how may epochs

20be5b90

thomasw21

Find a way to reset states everytime

d9719b6d

thomasw21

Find a way to reset states everytime

4e0c4caf

thomasw21

Find a way to reset states everytime

48a55b9a

thomasw21

Find a way to reset states everytime

2e469e5a

thomasw21

Find a way to reset states everytime

74e03ec4

thomasw21

Fix bugs

f4a4733e

thomasw21

Cleanup

e1a37677

thomasw21

Merge remote-tracking branch 'official_repo/main' into thomas/mtf_tra…

efeb55a1

thomasw21

Woops

de88ab63

thomasw21

Woops

d7a6388a

thomasw21

Woops

1c2284f1

thomasw21

Woops

b759a92a

thomasw21

Woops

ef20e57a

thomasw21

Silently skip samples that are too long

5816adfb

thomasw21

Build the index from scratch everytime

37ad57e6

thomasw21

Prevent empty dataset

1572ddc9

thomasw21

Change the condition for empty slice

bebb481a

thomasw21

PR reviews

5c806992

thomasw21

Revert back changes linked to shutil.copytree

985cd028

thomasw21

Get test working

41e931a9

thomasw21

Woops

b321a349

thomasw21

Woops

0450bad8

thomasw21

Fix empty samples

de4934f5

thomasw21

Cuda kernel is not strictly equivalent

e3e21f55

thomasw21

Update tests/test_model.py

16c556c0

thomasw21

MTF optimize dataloading (#298)

f2df7715

thomasw21

Get pretrain on non causal mlm script

a45c9cd4

thomasw21

Test

606fdeb5

Base automatically changed from thomas/mtf_train_script to main 3 years ago

Login to write a write a comment.

Login via GitHub

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Milestone

No milestone