Megatron-DeepSpeed
Add UL2 data sampling and pretraining
#358
Open

Commits
  • Fix `PretrainedFromHF` tokenizer with T5 training
    janEbert committed 3 years ago
  • Allow passing existing casual attention masks
    janEbert committed 3 years ago
  • Refactor masked LM sampling style selection
    janEbert committed 3 years ago
  • Add more masked LM sampling styles
    janEbert committed 3 years ago
  • Allow Prefix-LM style masked LM
    janEbert committed 3 years ago
  • Add UL2 pretraining for T5 model
    janEbert committed 3 years ago
  • Refactor span merging
    janEbert committed 3 years ago
  • Support UL2 for decoder-only models
    janEbert committed 3 years ago
  • Unconditionally use safe maximum sequence length
    janEbert committed 3 years ago
  • Add custom exceptions
    janEbert committed 3 years ago
  • Error out on too long sequences
    janEbert committed 3 years ago
  • Remove additional sequence truncation
    janEbert committed 3 years ago
  • Prefer array-from-list creation
    janEbert committed 3 years ago
  • Remove redundant imports
    janEbert committed 3 years ago
  • Fix not inserting prefixes
    janEbert committed 3 years ago
  • Do not insert `extra_id` tokens for PrefixLM task
    janEbert committed 3 years ago
  • Document `max_seq_length_dec` argument
    janEbert committed 3 years ago
  • Skip redundant computations
    janEbert committed 3 years ago
  • Fix PrefixLM mean location
    janEbert committed 3 years ago
  • Pad decoder-only inputs to same length
    janEbert committed 3 years ago
  • Fix decoder-only attention mask shape
    janEbert committed 3 years ago
  • Document index set selection for PrefixLM masking
    janEbert committed 3 years ago
  • Fix `max_ngrams` for normal sampling style
    janEbert committed 3 years ago
  • Do not limit `max_predictions_per_seq`
    janEbert committed 3 years ago
  • Calculate and use amount of filtered tokens
    janEbert committed 3 years ago
  • Document normal sampling style
    janEbert committed 3 years ago
  • Fix PrefixLM possible spans calculation
    janEbert committed 3 years ago
  • Use binary search for PrefixLM first tail index
    janEbert committed 3 years ago
  • Calculate n-gram indices lazily
    janEbert committed 3 years ago
  • Fix code style
    janEbert committed 3 years ago
  • Prefer list comprehensions
    janEbert committed 3 years ago
  • Allow recognizing when UL2 is used
    janEbert committed 3 years ago
  • Support UL2 tokens for all tokenizers
    janEbert committed 3 years ago
  • Support `<extra_id>` tokens for GPT tokenizer
    janEbert committed 3 years ago
  • Fix tokenizer vocab access
    janEbert committed 3 years ago
  • Revert inheriting from `T5Dataset`
    janEbert committed 3 years ago
  • Fix GPT tokenizer special token handling
    janEbert committed 3 years ago
  • Do inherit from `torch.utils.data.Dataset`
    janEbert committed 3 years ago
  • Add whitespace
    janEbert committed 3 years ago
  • Allow selectively disabling denoiser token
    janEbert committed 3 years ago
  • Allow not replacing masks with sentinel tokens
    janEbert committed 3 years ago
  • Support not adding mask tokens in span corruption
    janEbert committed 3 years ago
  • Fix expected number of added tokens
    janEbert committed 3 years ago
  • Fix non-masked data
    janEbert committed 3 years ago
  • Fix unclear wording
    janEbert committed 3 years ago
  • Adjust code style
    janEbert committed 3 years ago
  • Fix covered index skipping
    janEbert committed 3 years ago
  • Prepend objective token before truncating
    janEbert committed 3 years ago
  • Automatically truncate sequences for decoder-only
    janEbert committed 3 years ago
  • Fix covered span skipping fix
    janEbert committed 3 years ago
  • Make `build_index_mappings` public
    janEbert committed 3 years ago
  • Refactor getting sample
    janEbert committed 3 years ago
  • Add sample packing to T5 dataset
    janEbert committed 3 years ago
  • Add sample packing to UL2 dataset
    janEbert committed 3 years ago
  • Fix typo and comment placement
    janEbert committed 3 years ago
  • Fix not supplying `--pack-samples` argument
    janEbert committed 3 years ago
  • Add support for UL2R-style implementation
    janEbert committed 3 years ago
  • Fix T5 dataset packing
    janEbert committed 3 years ago
  • Refactor `get_sample` to return a list
    janEbert committed 3 years ago
  • Fix T5 sample packing
    janEbert committed 3 years ago
  • Fix UL2 sample packing
    janEbert committed 3 years ago
  • Refactor samples dict creation
    janEbert committed 3 years ago
  • Fix desired seq length
    janEbert committed 3 years ago
  • Fix padding removal
    janEbert committed 3 years ago
  • Allow repeating UL2 prompt token when packing
    janEbert committed 3 years ago
  • Allow packing different denoisers together
    janEbert committed 3 years ago
  • Refactor sample packing functions
    janEbert committed 3 years ago
  • Repeat prompt by default when packing UL2
    janEbert committed 3 years ago
  • Support pipelining for decoder-only model
    janEbert committed 3 years ago
  • Fix GPT tokenizer vocab size query
    janEbert committed 3 years ago
  • Handle possibly empty list
    janEbert committed 3 years ago
  • Fix no newline at EOF
    janEbert committed 3 years ago
  • Allow full prefix Prefix-LM attention sampling
    janEbert committed 3 years ago
  • Support PrefixLM models
    janEbert committed 3 years ago
  • Allow setting number of few-shot examples
    janEbert committed 3 years ago
  • Update task/dataset name
    janEbert committed 3 years ago
  • Do not remove last token
    janEbert committed 3 years ago
  • Fix PrefixLM contexts
    janEbert committed 3 years ago
  • Fix module refactor
    janEbert committed 3 years ago
  • Fix possible `TypeError`
    janEbert committed 3 years ago
  • Optionally add prefix tokens
    janEbert committed 3 years ago
  • Automatically add UL2 tokens
    janEbert committed 3 years ago
  • Fix context lengths batch chunking
    janEbert committed 3 years ago
  • Allow different models to be loaded
    janEbert committed 3 years ago
  • Fix context batch size padding
    janEbert committed 3 years ago
  • Add xPos embeddings
    janEbert committed 3 years ago
  • Add optional UL2 normal distribution scaling
    janEbert committed 3 years ago
  • Allow evaluating encoder-decoder models
    janEbert committed 3 years ago
  • Fix not passing `scale_normal_std`
    janEbert committed 3 years ago
  • Add T5-style GLU layers
    janEbert committed 3 years ago
  • Rename xPos embedding class
    janEbert committed 3 years ago
  • Integrate xPos embedding
    janEbert committed 3 years ago
  • Handle xPos embedding
    janEbert committed 3 years ago
  • Do not use bias for 2nd MLP layer if using T5 GLU
    janEbert committed 3 years ago
  • Fix T5 GLU constructor arguments
    janEbert committed 3 years ago
  • Refactor samples dict creation
    janEbert committed 3 years ago
  • Move callees under caller
    janEbert committed 3 years ago
  • Handle empty context
    janEbert committed 3 years ago
  • Handle more possible model types
    janEbert committed 3 years ago
  • Fix fully truncated contexts with prefix tokens
    janEbert committed 3 years ago
  • + more commits ...
Loading