Refactor-tokenization-more (#42563)
* On commit to bind them all
* nits
* smnall update
* elif
* super small nit
* BPE!
* fix
* up up up
* fix?
* one typo
* per model updates
* more model specific updates
* more per model updates
* more model specific updates
* simplify default merges
* fiuxp
* update
* update
* style
* fix
* fix colpali
* nits
* simpler regex + big shitty bird
* fixup and fix
* fix codellama
* up
* fix pop on none
* fix parkeet
* fix llama
* big fixup
* fix markul lm
* update common
* fix mbart
* fix seamlessm4T
* fix comment
* torch tests
* nnits and revert UNK idx change
* oh only one deberta
* torch tests
* add convert from spm per model!
* fix last 2 for pegasus
* fix torch tests
* fixes
* fix tests
* check versioned files
* fix processor auto test
* fix custom tok clip
* try this fix
* modeling rag
* fix rag
* roformer the Tokenizers way
* up
* updatge
* fix unk
* update
* fix roberta
* if there is no mapped class and no tokenizer.json its fucked -> just have the mapped class ready!
* fix the rest
* fix copies
* fix doc and copies
* fix mbart50
* fix deberta_v2 test
* fix and simplify whisper :)
* fix big bird default was worng
* fix final
* fixup
* small nit
* a weird way to fix fuyu?
* default xlm roberta to fix kosmo behaviour!
* remove small errors
* last fix?
* fix pixtral
* style
* fix
* quality ta radce
* fix?
* remove something
* remov one code that shouldd not have been there!
* fix ?
* fixup
* update
* fix for custom code
* add a custom model path to make sure custom stuff is registe
* fix trust remote code
* exceeded
* don't
* ouppsy for cohere
* why is this one also affected
* fixup
* fixup
* nits
* fix idefics3 tests
* okay read the processor
* fix the layout.... models
* nits
* codellama needs the bos passed
* fix dpr
* fix?
* fixup
* distilbert defaults
* fix
* clvp update to PythonTokenizer
* bloom
* style
* layoutxlm
* style
* olmo
* only pop when we don't convert from tokenizer.json
* fixup
* hub issue
* id
* fix
---------
Co-authored-by: itazap <ita.zaporozhets@huggingface.co>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-45.ec2.internal>