Easily train a new fast tokenizer from a given one (#12361)
* [WIP] Easily train a new fast tokenizer from a given one
* Fix test
* Roll out to other tokenizers and add tests
* Fix bug with unk id and add emoji to test
* Really use something different in test
* Implement special tokens map
* Map special tokens in the Transformers tokenizers
* Fix test
* Make test more robust
* Fix test for BPE
* More robust map and test
Co-authored-by SaulLu
* Test file
* Stronger tests
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
* Map unk token for Wordpiece and address review comment
* Fix lowercase test and address review comment
* Fix all tests
* Simplify test
* Fix tests for realsies
* Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420)
* Propose change in tests regarding lower case
* add new test for special tokens types
* put back the test part about decoding
* add feature: the AddedToken is re-build with the different mapped content
* Address review comment: simplify AddedToken building
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>