Add UL2 data sampling and pretraining #358
Fix `PretrainedFromHF` tokenizer with T5 training
b2fc6656
Allow passing existing casual attention masks
13becf1b
Refactor masked LM sampling style selection
7f50532d
Add more masked LM sampling styles
d8db1892
Allow Prefix-LM style masked LM
006c4e96
janEbert
force pushed
from
21772249
to
d61f7a9b
3 years ago
janEbert
force pushed
from
d61f7a9b
to
db95ce81
3 years ago
janEbert
force pushed
from
db95ce81
to
4d9ff775
3 years ago
Add UL2 pretraining for T5 model
f8023178
Refactor span merging
deed87f7
janEbert
force pushed
from
4d9ff775
to
ab858336
3 years ago
Support UL2 for decoder-only models
728e076d
janEbert
force pushed
from
ab858336
to
728e076d
3 years ago
Unconditionally use safe maximum sequence length
42ece6b8
Add custom exceptions
d18f84e5
Error out on too long sequences
fa5aa68b
Remove additional sequence truncation
c7d8a8ba
Prefer array-from-list creation
c7225163
Remove redundant imports
69f6e707
Fix not inserting prefixes
f08a104b
Do not insert `extra_id` tokens for PrefixLM task
d2fd03e6
Document `max_seq_length_dec` argument
daf52cc0
Skip redundant computations
04be5905
Fix PrefixLM mean location
7bc5a877
Pad decoder-only inputs to same length
775e99d8
Fix decoder-only attention mask shape
538c30bf
Document index set selection for PrefixLM masking
ba4476c7
Fix `max_ngrams` for normal sampling style
678fbdca
janEbert
force pushed
from
60ceb79c
to
96a287a6
3 years ago
Do not limit `max_predictions_per_seq`
00479e5d
Calculate and use amount of filtered tokens
795caef6
Document normal sampling style
689e15f9
Fix PrefixLM possible spans calculation
e44d0e49
janEbert
force pushed
from
96a287a6
to
e44d0e49
3 years ago
Use binary search for PrefixLM first tail index
075f05fd
Calculate n-gram indices lazily
6bc7471d
Fix code style
a105f320
Prefer list comprehensions
f0fe282a
Allow recognizing when UL2 is used
11bd6db5
Support UL2 tokens for all tokenizers
43eee931
Support `<extra_id>` tokens for GPT tokenizer
6686f042
Fix tokenizer vocab access
f6128c63
Revert inheriting from `T5Dataset`
8f48763f
Fix GPT tokenizer special token handling
7f99a120
Do inherit from `torch.utils.data.Dataset`
535a3069
Add whitespace
db623b35
Allow selectively disabling denoiser token
ef72280f
Allow not replacing masks with sentinel tokens
001b50cd
Support not adding mask tokens in span corruption
23c052f5
Fix expected number of added tokens
0f4fd3ff
Fix non-masked data
da1f4e90
Fix unclear wording
55320eaf
Adjust code style
5d27b27f
Fix covered index skipping
23181ab3
Prepend objective token before truncating
6032cc6c
Automatically truncate sequences for decoder-only
c9c336f7
janEbert
force pushed
from
da07a5e4
to
c9c336f7
3 years ago
Fix covered span skipping fix
b8003cba
Make `build_index_mappings` public
e3d91a6a
Refactor getting sample
e61e78fa
Add sample packing to T5 dataset
c3b0a55e
Add sample packing to UL2 dataset
c4d748ba
Fix typo and comment placement
689b57e3
Fix not supplying `--pack-samples` argument
af204e75
Add support for UL2R-style implementation
78eb0358
Fix T5 dataset packing
c03eed4e
Refactor `get_sample` to return a list
9e84f06d
Fix T5 sample packing
5e2b4f54
Fix UL2 sample packing
e2a0c36d
Refactor samples dict creation
c2884c8c
Fix desired seq length
7eb79236
Fix padding removal
dd4c0d0d
Allow repeating UL2 prompt token when packing
58148f8a
Allow packing different denoisers together
c41fecd0
Refactor sample packing functions
057bb476
Repeat prompt by default when packing UL2
e2062b79
Support pipelining for decoder-only model
d31b89f7
Fix GPT tokenizer vocab size query
17dca4fe
Handle possibly empty list
bf9b1eb5
Fix no newline at EOF
c4aa4cdc
Allow full prefix Prefix-LM attention sampling
8d7a0dfb
Support PrefixLM models
9bd6e1e2
Allow setting number of few-shot examples
ba4ab491
Update task/dataset name
9f531711
Do not remove last token
5b63d0b5
Fix PrefixLM contexts
639b71d2
Fix module refactor
127d1e49
Fix possible `TypeError`
1bb788d0
Optionally add prefix tokens
cf5965a1
Automatically add UL2 tokens
a5382384
Fix context lengths batch chunking
3a8bc356
Allow different models to be loaded
6f0e33a7
Fix context batch size padding
9c4c7187
Add xPos embeddings
754cf21a
Add optional UL2 normal distribution scaling
08b0eaf7
Allow evaluating encoder-decoder models
15622d21
janEbert
force pushed
from
0557bb71
to
d1a9dcc3
3 years ago
Fix not passing `scale_normal_std`
e5a6169d
Add T5-style GLU layers
d583fe9d
janEbert
force pushed
from
d72c0091
to
d583fe9d
3 years ago
Rename xPos embedding class
ad7de7ee
Integrate xPos embedding
81a68f79
Handle xPos embedding
46e145d5
Do not use bias for 2nd MLP layer if using T5 GLU
482f0ea9
Fix T5 GLU constructor arguments
4385f7b6
Refactor samples dict creation
2d24b13b
Move callees under caller
bd461f5f
Handle empty context
35b2956a
Handle more possible model types
f0171e01
Fix fully truncated contexts with prefix tokens
92158d86
Make T5 GLU checks safer
3b7692f9
Improve import code style
b37d3ee1
Refactor dummy barriers
5959e89e
Refactor file name creation
ce8c1a5a
Allow packing only full documents
3e529661
Use full-doc packing for T5-style datasets
23efa88b
Fix trying to all-reduce non-existent bias
88eb98ad
Fix truncating packed sequences without padding
59e84516
Speed up packed dataset indexing
24d46ff0
Try to exit padding removal early
600542da
Fix xPos embedding
58831d2b
Fix padding loss mask
fe45cea4
Handle failure mode regarding non-DS checkpoints
15e7b988
Fix decoder-only and no-mask-tokens seq lengths
ae45a9ec
Omit second objective token if without mask tokens
0c91b960
Fix NumPy deprecations
0c246c46
Fix supplied arguments
7ce86350
Do not add separator if S-denoising
7290181c
Fix caching error
628d847b
Fix number of labels calculation for decoder-only
9c727e7b
Do not automatically add <EOS> token when packing
4ffa9519
Allow silently ignoring causal attention mask
ff5787ee
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub