transformers
Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process
#32191
Merged

Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process #32191

xenova merged 17 commits into main from fix-slow-tokenizer
xenova
xenova Remove user-defined tokens which can be obtained through merges
2b93e03f
xenova Remove debug line
7cafbfff
xenova formatting
65482075
amyeroberts
ArthurZucker
ArthurZucker commented on 2024-07-24
xenova xenova requested a review from ArthurZucker ArthurZucker 1 year ago
HuggingFaceDocBuilderDev
xenova Refactor spm slow -> fast converter
8cea780c
xenova revert unnecessary refactor
c9bc1b67
xenova set comprehension
f0156daf
xenova remove test files
49dbd699
xenova Use `vocab_scores`
5b880537
xenova Always replace spiece underline with space in decode
c31f7c77
xenova we no longer need token filtering
7174b487
xenova Add save fast load slow unit test
3851755f
xenova xenova changed the title Remove user-defined tokens which can be obtained through merges Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process 1 year ago
ArthurZucker
ArthurZucker commented on 2024-07-26
xenova Remove tokenizers version check
f0f8103a
xenova xenova requested a review from ArthurZucker ArthurZucker 1 year ago
xenova Remove duplicate code
0e82f602
xenova Make `<start_of_turn>` and `<end_of_turn>` special tokens
cd1118cc
xenova Bias merge priority with length if score is the same
b79e6462
xenova Add unit test for merge priority
36c6fb1f
xenova xenova requested a review from pcuenca pcuenca 1 year ago
xenova
pcuenca
pcuenca commented on 2024-07-30
pcuenca
pcuenca approved these changes on 2024-07-30
ArthurZucker
ArthurZucker approved these changes on 2024-07-30
xenova
xenova CI
acfc821b
xenova xenova merged 6e2d04e4 into main 1 year ago
xenova xenova deleted the fix-slow-tokenizer branch 1 year ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone