PR #358 Add UL2 data sampling and pretraining

Fix `PretrainedFromHF` tokenizer with T5 training

b2fc6656

Allow passing existing casual attention masks

13becf1b

Refactor masked LM sampling style selection

7f50532d

Add more masked LM sampling styles

d8db1892

Allow Prefix-LM style masked LM

006c4e96

janEbert force pushed from 21772249 to d61f7a9b 3 years ago

janEbert force pushed from d61f7a9b to db95ce81 3 years ago

janEbert force pushed from db95ce81 to 4d9ff775 3 years ago

Add UL2 pretraining for T5 model

f8023178

Refactor span merging

deed87f7

janEbert force pushed from 4d9ff775 to ab858336 3 years ago

Support UL2 for decoder-only models

728e076d

janEbert force pushed from ab858336 to 728e076d 3 years ago

Unconditionally use safe maximum sequence length

42ece6b8

Add custom exceptions

d18f84e5

Error out on too long sequences

fa5aa68b

Remove additional sequence truncation

c7d8a8ba

Prefer array-from-list creation

c7225163

Muennighoff commented on 2022-12-28

Remove redundant imports

69f6e707

Fix not inserting prefixes

f08a104b

Do not insert `extra_id` tokens for PrefixLM task

d2fd03e6

Document `max_seq_length_dec` argument

daf52cc0

Skip redundant computations

04be5905

Fix PrefixLM mean location

7bc5a877

Pad decoder-only inputs to same length

775e99d8

Fix decoder-only attention mask shape

538c30bf

Document index set selection for PrefixLM masking

ba4476c7

Fix `max_ngrams` for normal sampling style

678fbdca

janEbert force pushed from 60ceb79c to 96a287a6 3 years ago

Do not limit `max_predictions_per_seq`

00479e5d

Calculate and use amount of filtered tokens

795caef6

Document normal sampling style

689e15f9

Fix PrefixLM possible spans calculation

e44d0e49

janEbert force pushed from 96a287a6 to e44d0e49 3 years ago

Use binary search for PrefixLM first tail index

075f05fd

Calculate n-gram indices lazily

6bc7471d

Fix code style

a105f320

Prefer list comprehensions

f0fe282a

Allow recognizing when UL2 is used

11bd6db5

Support UL2 tokens for all tokenizers

43eee931

Support `<extra_id>` tokens for GPT tokenizer

6686f042

Fix tokenizer vocab access

f6128c63

Revert inheriting from `T5Dataset`

8f48763f

Fix GPT tokenizer special token handling

7f99a120

Do inherit from `torch.utils.data.Dataset`

535a3069

Add whitespace

db623b35

Allow selectively disabling denoiser token

ef72280f

Allow not replacing masks with sentinel tokens

001b50cd

Support not adding mask tokens in span corruption

23c052f5

Fix expected number of added tokens

0f4fd3ff

Fix non-masked data

da1f4e90

Fix unclear wording

55320eaf

Adjust code style

5d27b27f

Fix covered index skipping

23181ab3

Prepend objective token before truncating

6032cc6c

Automatically truncate sequences for decoder-only

c9c336f7

janEbert force pushed from da07a5e4 to c9c336f7 3 years ago

Fix covered span skipping fix

b8003cba

Make `build_index_mappings` public

e3d91a6a

Refactor getting sample

e61e78fa

Add sample packing to T5 dataset

c3b0a55e

Add sample packing to UL2 dataset

c4d748ba

Fix typo and comment placement

689b57e3

Fix not supplying `--pack-samples` argument

af204e75

Add support for UL2R-style implementation

78eb0358

Fix T5 dataset packing

c03eed4e

Refactor `get_sample` to return a list

9e84f06d

Fix T5 sample packing

5e2b4f54

Fix UL2 sample packing

e2a0c36d

Refactor samples dict creation

c2884c8c

Fix desired seq length

7eb79236

Fix padding removal

dd4c0d0d

Allow repeating UL2 prompt token when packing

58148f8a

Allow packing different denoisers together

c41fecd0

Refactor sample packing functions

057bb476

Repeat prompt by default when packing UL2

e2062b79

Support pipelining for decoder-only model

d31b89f7

Fix GPT tokenizer vocab size query

17dca4fe

Handle possibly empty list

bf9b1eb5

Fix no newline at EOF

c4aa4cdc

Allow full prefix Prefix-LM attention sampling

8d7a0dfb

Support PrefixLM models

9bd6e1e2

Allow setting number of few-shot examples

ba4ab491

Update task/dataset name

9f531711

Do not remove last token

5b63d0b5

Fix PrefixLM contexts

639b71d2

Fix module refactor

127d1e49

Fix possible `TypeError`

1bb788d0

Optionally add prefix tokens

cf5965a1

Automatically add UL2 tokens

a5382384

Fix context lengths batch chunking

3a8bc356

Allow different models to be loaded

6f0e33a7

Fix context batch size padding

9c4c7187

Add xPos embeddings

754cf21a

Add optional UL2 normal distribution scaling

08b0eaf7

Allow evaluating encoder-decoder models

15622d21

janEbert force pushed from 0557bb71 to d1a9dcc3 3 years ago

Fix not passing `scale_normal_std`

e5a6169d

Add T5-style GLU layers

d583fe9d

janEbert force pushed from d72c0091 to d583fe9d 3 years ago

Rename xPos embedding class

ad7de7ee

Integrate xPos embedding

81a68f79

Handle xPos embedding

46e145d5

Do not use bias for 2nd MLP layer if using T5 GLU

482f0ea9

Fix T5 GLU constructor arguments

4385f7b6

Refactor samples dict creation

2d24b13b

Move callees under caller

bd461f5f

Handle empty context

35b2956a

Handle more possible model types

f0171e01

Fix fully truncated contexts with prefix tokens

92158d86

Make T5 GLU checks safer

3b7692f9

Improve import code style

b37d3ee1

Refactor dummy barriers

5959e89e

Refactor file name creation

ce8c1a5a

Allow packing only full documents

3e529661

Use full-doc packing for T5-style datasets

23efa88b

Fix trying to all-reduce non-existent bias

88eb98ad

Fix truncating packed sequences without padding

59e84516

Speed up packed dataset indexing

24d46ff0

Try to exit padding removal early

600542da

Fix xPos embedding

58831d2b

Fix padding loss mask

fe45cea4

Handle failure mode regarding non-DS checkpoints

15e7b988

Fix decoder-only and no-mask-tokens seq lengths

ae45a9ec

Omit second objective token if without mask tokens

0c91b960

Fix NumPy deprecations

0c246c46

Fix supplied arguments

7ce86350

Do not add separator if S-denoising

7290181c

Fix caching error

628d847b

Fix number of labels calculation for decoder-only

9c727e7b

Do not automatically add <EOS> token when packing

4ffa9519

Allow silently ignoring causal attention mask

ff5787ee

Megatron-DeepSpeed
Add UL2 data sampling and pretraining
#358

Open

Add UL2 data sampling and pretraining #358

Megatron-DeepSpeed Add UL2 data sampling and pretraining #358 Open

Add UL2 data sampling and pretraining #358

Megatron-DeepSpeed
Add UL2 data sampling and pretraining
#358

Open