PR #358 Add UL2 data sampling and pretraining

Fix `PretrainedFromHF` tokenizer with T5 training

janEbert committed 3 years ago

Allow passing existing casual attention masks

janEbert committed 3 years ago

Refactor masked LM sampling style selection

janEbert committed 3 years ago

Add more masked LM sampling styles

janEbert committed 3 years ago

Allow Prefix-LM style masked LM

janEbert committed 3 years ago

Add UL2 pretraining for T5 model

janEbert committed 3 years ago

Refactor span merging

janEbert committed 3 years ago

Support UL2 for decoder-only models

janEbert committed 3 years ago

Unconditionally use safe maximum sequence length

janEbert committed 3 years ago

Add custom exceptions

janEbert committed 3 years ago

Error out on too long sequences

janEbert committed 3 years ago

Remove additional sequence truncation

janEbert committed 3 years ago

Prefer array-from-list creation

janEbert committed 3 years ago

Remove redundant imports

janEbert committed 3 years ago

Fix not inserting prefixes

janEbert committed 3 years ago

Do not insert `extra_id` tokens for PrefixLM task

janEbert committed 3 years ago

Document `max_seq_length_dec` argument

janEbert committed 3 years ago

Skip redundant computations

janEbert committed 3 years ago

Fix PrefixLM mean location

janEbert committed 3 years ago

Pad decoder-only inputs to same length

janEbert committed 3 years ago

Fix decoder-only attention mask shape

janEbert committed 3 years ago

Document index set selection for PrefixLM masking

janEbert committed 3 years ago

Fix `max_ngrams` for normal sampling style

janEbert committed 3 years ago

Do not limit `max_predictions_per_seq`

janEbert committed 3 years ago

Calculate and use amount of filtered tokens

janEbert committed 3 years ago

Document normal sampling style

janEbert committed 3 years ago

Fix PrefixLM possible spans calculation

janEbert committed 3 years ago

Use binary search for PrefixLM first tail index

janEbert committed 3 years ago

Calculate n-gram indices lazily

janEbert committed 3 years ago

Fix code style

janEbert committed 3 years ago

Prefer list comprehensions

janEbert committed 3 years ago

Allow recognizing when UL2 is used

janEbert committed 3 years ago

Support UL2 tokens for all tokenizers

janEbert committed 3 years ago

Support `<extra_id>` tokens for GPT tokenizer

janEbert committed 3 years ago

Fix tokenizer vocab access

janEbert committed 3 years ago

Revert inheriting from `T5Dataset`

janEbert committed 3 years ago

Fix GPT tokenizer special token handling

janEbert committed 3 years ago

Do inherit from `torch.utils.data.Dataset`

janEbert committed 3 years ago

Add whitespace

janEbert committed 3 years ago

Allow selectively disabling denoiser token

janEbert committed 3 years ago

Allow not replacing masks with sentinel tokens

janEbert committed 3 years ago

Support not adding mask tokens in span corruption

janEbert committed 3 years ago

Fix expected number of added tokens

janEbert committed 3 years ago

Fix non-masked data

janEbert committed 3 years ago

Fix unclear wording

janEbert committed 3 years ago

Adjust code style

janEbert committed 3 years ago

Fix covered index skipping

janEbert committed 3 years ago

Prepend objective token before truncating

janEbert committed 3 years ago

Automatically truncate sequences for decoder-only

janEbert committed 3 years ago

Fix covered span skipping fix

janEbert committed 3 years ago

Make `build_index_mappings` public

janEbert committed 3 years ago

Refactor getting sample

janEbert committed 3 years ago

Add sample packing to T5 dataset

janEbert committed 3 years ago

Add sample packing to UL2 dataset

janEbert committed 3 years ago

Fix typo and comment placement

janEbert committed 3 years ago

Fix not supplying `--pack-samples` argument

janEbert committed 3 years ago

Add support for UL2R-style implementation

janEbert committed 3 years ago

Fix T5 dataset packing

janEbert committed 3 years ago

Refactor `get_sample` to return a list

janEbert committed 3 years ago

Fix T5 sample packing

janEbert committed 3 years ago

Fix UL2 sample packing

janEbert committed 3 years ago

Refactor samples dict creation

janEbert committed 3 years ago

Fix desired seq length

janEbert committed 3 years ago

Fix padding removal

janEbert committed 3 years ago

Allow repeating UL2 prompt token when packing

janEbert committed 3 years ago

Allow packing different denoisers together

janEbert committed 3 years ago

Refactor sample packing functions

janEbert committed 3 years ago

Repeat prompt by default when packing UL2

janEbert committed 3 years ago

Support pipelining for decoder-only model

janEbert committed 3 years ago

Fix GPT tokenizer vocab size query

janEbert committed 3 years ago

Handle possibly empty list

janEbert committed 3 years ago

Fix no newline at EOF

janEbert committed 3 years ago

Allow full prefix Prefix-LM attention sampling

janEbert committed 3 years ago

Support PrefixLM models

janEbert committed 3 years ago

Allow setting number of few-shot examples

janEbert committed 3 years ago

Update task/dataset name

janEbert committed 3 years ago

Do not remove last token

janEbert committed 3 years ago

Fix PrefixLM contexts

janEbert committed 3 years ago

Fix module refactor

janEbert committed 3 years ago

Fix possible `TypeError`

janEbert committed 3 years ago

Optionally add prefix tokens

janEbert committed 3 years ago

Automatically add UL2 tokens

janEbert committed 3 years ago

Fix context lengths batch chunking

janEbert committed 3 years ago

Allow different models to be loaded

janEbert committed 3 years ago

Fix context batch size padding

janEbert committed 3 years ago

Add xPos embeddings

janEbert committed 3 years ago

Add optional UL2 normal distribution scaling

janEbert committed 3 years ago

Allow evaluating encoder-decoder models

janEbert committed 3 years ago

Fix not passing `scale_normal_std`

janEbert committed 3 years ago

Add T5-style GLU layers

janEbert committed 3 years ago

Rename xPos embedding class

janEbert committed 3 years ago

Integrate xPos embedding

janEbert committed 3 years ago

Handle xPos embedding

janEbert committed 3 years ago

Do not use bias for 2nd MLP layer if using T5 GLU

janEbert committed 3 years ago

Fix T5 GLU constructor arguments

janEbert committed 3 years ago

Refactor samples dict creation

janEbert committed 3 years ago

Move callees under caller

janEbert committed 3 years ago

Handle empty context

janEbert committed 3 years ago

Handle more possible model types

janEbert committed 3 years ago

Fix fully truncated contexts with prefix tokens

janEbert committed 3 years ago

Megatron-DeepSpeed Add UL2 data sampling and pretraining #358 Open

Megatron-DeepSpeed
Add UL2 data sampling and pretraining
#358

Open