Add LRU cache, add faster tokenization (#37)
* Update gpt2_tokenization.py
Adding LRU cache and speeding up tokenization.
* Update gpt2_tokenization.py
Removing _old method. Note that the chinese token processing is optional and not used currently in training.
* Update gpt2_tokenization.py
* Update preprocess_data.py
The path needs to be set before we can find the "megatron" package.
* Update gpt2_tokenization.py
Adding comments about max_token_len_cache
* Update megatron/tokenizer/gpt2_tokenization.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update gpt2_tokenization.py
* Update megatron/tokenizer/gpt2_tokenization.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Update gpt2_tokenization.py
* Update gpt2_tokenization.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>