Megatron-DeepSpeed
Add UL2 data sampling and pretraining
#358
Open

Add UL2 data sampling and pretraining #358

janEbert wants to merge 122 commits into bigscience-workshop:main from janEbert:ul2
janEbert
janEbert Fix `PretrainedFromHF` tokenizer with T5 training
b2fc6656
janEbert Allow passing existing casual attention masks
13becf1b
janEbert Refactor masked LM sampling style selection
7f50532d
janEbert Add more masked LM sampling styles
d8db1892
janEbert Allow Prefix-LM style masked LM
006c4e96
janEbert janEbert force pushed from 21772249 to d61f7a9b 3 years ago
janEbert janEbert force pushed from d61f7a9b to db95ce81 3 years ago
janEbert janEbert force pushed from db95ce81 to 4d9ff775 3 years ago
janEbert Add UL2 pretraining for T5 model
f8023178
janEbert Refactor span merging
deed87f7
janEbert janEbert force pushed from 4d9ff775 to ab858336 3 years ago
janEbert Support UL2 for decoder-only models
728e076d
janEbert janEbert force pushed from ab858336 to 728e076d 3 years ago
janEbert Unconditionally use safe maximum sequence length
42ece6b8
janEbert Add custom exceptions
d18f84e5
janEbert Error out on too long sequences
fa5aa68b
janEbert Remove additional sequence truncation
c7d8a8ba
janEbert
janEbert Prefer array-from-list creation
c7225163
Muennighoff
Muennighoff commented on 2022-12-28
janEbert Remove redundant imports
69f6e707
janEbert Fix not inserting prefixes
f08a104b
janEbert Do not insert `extra_id` tokens for PrefixLM task
d2fd03e6
janEbert Document `max_seq_length_dec` argument
daf52cc0
janEbert Skip redundant computations
04be5905
janEbert Fix PrefixLM mean location
7bc5a877
janEbert Pad decoder-only inputs to same length
775e99d8
janEbert Fix decoder-only attention mask shape
538c30bf
janEbert
janEbert Document index set selection for PrefixLM masking
ba4476c7
janEbert Fix `max_ngrams` for normal sampling style
678fbdca
janEbert janEbert force pushed from 60ceb79c to 96a287a6 3 years ago
janEbert Do not limit `max_predictions_per_seq`
00479e5d
janEbert Calculate and use amount of filtered tokens
795caef6
janEbert Document normal sampling style
689e15f9
janEbert Fix PrefixLM possible spans calculation
e44d0e49
janEbert janEbert force pushed from 96a287a6 to e44d0e49 3 years ago
janEbert Use binary search for PrefixLM first tail index
075f05fd
janEbert Calculate n-gram indices lazily
6bc7471d
janEbert Fix code style
a105f320
janEbert Prefer list comprehensions
f0fe282a
janEbert Allow recognizing when UL2 is used
11bd6db5
janEbert Support UL2 tokens for all tokenizers
43eee931
janEbert Support `<extra_id>` tokens for GPT tokenizer
6686f042
janEbert Fix tokenizer vocab access
f6128c63
janEbert Revert inheriting from `T5Dataset`
8f48763f
janEbert Fix GPT tokenizer special token handling
7f99a120
janEbert Do inherit from `torch.utils.data.Dataset`
535a3069
janEbert Add whitespace
db623b35
janEbert Allow selectively disabling denoiser token
ef72280f
janEbert Allow not replacing masks with sentinel tokens
001b50cd
janEbert Support not adding mask tokens in span corruption
23c052f5
janEbert Fix expected number of added tokens
0f4fd3ff
janEbert Fix non-masked data
da1f4e90
janEbert Fix unclear wording
55320eaf
janEbert Adjust code style
5d27b27f
janEbert Fix covered index skipping
23181ab3
janEbert Prepend objective token before truncating
6032cc6c
janEbert Automatically truncate sequences for decoder-only
c9c336f7
janEbert janEbert force pushed from da07a5e4 to c9c336f7 3 years ago
janEbert Fix covered span skipping fix
b8003cba
janEbert Make `build_index_mappings` public
e3d91a6a
janEbert Refactor getting sample
e61e78fa
janEbert Add sample packing to T5 dataset
c3b0a55e
janEbert Add sample packing to UL2 dataset
c4d748ba
janEbert Fix typo and comment placement
689b57e3
janEbert Fix not supplying `--pack-samples` argument
af204e75
janEbert Add support for UL2R-style implementation
78eb0358
janEbert Fix T5 dataset packing
c03eed4e
janEbert Refactor `get_sample` to return a list
9e84f06d
janEbert Fix T5 sample packing
5e2b4f54
janEbert Fix UL2 sample packing
e2a0c36d
janEbert Refactor samples dict creation
c2884c8c
janEbert Fix desired seq length
7eb79236
janEbert Fix padding removal
dd4c0d0d
janEbert Allow repeating UL2 prompt token when packing
58148f8a
janEbert Allow packing different denoisers together
c41fecd0
janEbert Refactor sample packing functions
057bb476
janEbert Repeat prompt by default when packing UL2
e2062b79
janEbert Support pipelining for decoder-only model
d31b89f7
janEbert Fix GPT tokenizer vocab size query
17dca4fe
janEbert Handle possibly empty list
bf9b1eb5
janEbert Fix no newline at EOF
c4aa4cdc
janEbert Allow full prefix Prefix-LM attention sampling
8d7a0dfb
janEbert Support PrefixLM models
9bd6e1e2
janEbert Allow setting number of few-shot examples
ba4ab491
janEbert Update task/dataset name
9f531711
janEbert Do not remove last token
5b63d0b5
janEbert Fix PrefixLM contexts
639b71d2
janEbert Fix module refactor
127d1e49
janEbert Fix possible `TypeError`
1bb788d0
janEbert Optionally add prefix tokens
cf5965a1
janEbert Automatically add UL2 tokens
a5382384
janEbert Fix context lengths batch chunking
3a8bc356
janEbert Allow different models to be loaded
6f0e33a7
janEbert Fix context batch size padding
9c4c7187
janEbert Add xPos embeddings
754cf21a
janEbert Add optional UL2 normal distribution scaling
08b0eaf7
janEbert Allow evaluating encoder-decoder models
15622d21
janEbert janEbert force pushed from 0557bb71 to d1a9dcc3 3 years ago
janEbert Fix not passing `scale_normal_std`
e5a6169d
janEbert Add T5-style GLU layers
d583fe9d
janEbert janEbert force pushed from d72c0091 to d583fe9d 3 years ago
janEbert Rename xPos embedding class
ad7de7ee
janEbert Integrate xPos embedding
81a68f79
janEbert Handle xPos embedding
46e145d5
janEbert Do not use bias for 2nd MLP layer if using T5 GLU
482f0ea9
janEbert Fix T5 GLU constructor arguments
4385f7b6
janEbert Refactor samples dict creation
2d24b13b
janEbert Move callees under caller
bd461f5f
janEbert Handle empty context
35b2956a
janEbert Handle more possible model types
f0171e01
janEbert Fix fully truncated contexts with prefix tokens
92158d86
janEbert Make T5 GLU checks safer
3b7692f9
janEbert Improve import code style
b37d3ee1
janEbert Refactor dummy barriers
5959e89e
janEbert Refactor file name creation
ce8c1a5a
janEbert Allow packing only full documents
3e529661
janEbert Use full-doc packing for T5-style datasets
23efa88b
janEbert Fix trying to all-reduce non-existent bias
88eb98ad
janEbert Fix truncating packed sequences without padding
59e84516
janEbert Speed up packed dataset indexing
24d46ff0
janEbert Try to exit padding removal early
600542da
janEbert Fix xPos embedding
58831d2b
janEbert
janEbert Fix padding loss mask
fe45cea4
janEbert Handle failure mode regarding non-DS checkpoints
15e7b988
janEbert Fix decoder-only and no-mask-tokens seq lengths
ae45a9ec
janEbert Omit second objective token if without mask tokens
0c91b960
janEbert Fix NumPy deprecations
0c246c46
janEbert Fix supplied arguments
7ce86350
janEbert Do not add separator if S-denoising
7290181c
janEbert Fix caching error
628d847b
janEbert Fix number of labels calculation for decoder-only
9c727e7b
janEbert Do not automatically add <EOS> token when packing
4ffa9519
janEbert Allow silently ignoring causal attention mask
ff5787ee

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone