Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process #32191
Remove user-defined tokens which can be obtained through merges
2b93e03f
Remove debug line
7cafbfff
formatting
65482075
Refactor spm slow -> fast converter
8cea780c
revert unnecessary refactor
c9bc1b67
set comprehension
f0156daf
remove test files
49dbd699
Use `vocab_scores`
5b880537
Always replace spiece underline with space in decode
c31f7c77
we no longer need token filtering
7174b487
Add save fast load slow unit test
3851755f
xenova
changed the title Remove user-defined tokens which can be obtained through merges Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process 1 year ago
Remove tokenizers version check
f0f8103a
Remove duplicate code
0e82f602
Make `<start_of_turn>` and `<end_of_turn>` special tokens
cd1118cc
Bias merge priority with length if score is the same
b79e6462
Add unit test for merge priority
36c6fb1f
pcuenca
approved these changes
on 2024-07-30
CI
acfc821b
xenova
merged
6e2d04e4
into main 1 year ago
xenova
deleted the fix-slow-tokenizer branch 1 year ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub