Add LRU cache, add faster tokenization (#37)

Commit

4 years ago

Add LRU cache, add faster tokenization (#37) * Update gpt2_tokenization.py Adding LRU cache and speeding up tokenization. * Update gpt2_tokenization.py Removing _old method. Note that the chinese token processing is optional and not used currently in training. * Update gpt2_tokenization.py * Update preprocess_data.py The path needs to be set before we can find the "megatron" package. * Update gpt2_tokenization.py Adding comments about max_token_len_cache * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update gpt2_tokenization.py * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update gpt2_tokenization.py * Update gpt2_tokenization.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

References

#37 - Add LRU cache, add faster tokenization

Author

huu4ontocord

Parents

f919d0cf

Megatron-DeepSpeed 36284576 - Add LRU cache, add faster tokenization (#37)

Megatron-DeepSpeed
36284576 - Add LRU cache, add faster tokenization (#37)