Improve BERT tokenization for accented characters and non-latin scripts #5740
implement nfd for stripping accents in wpm tokenizer
9c996e3d
sort nfd map; reuse iterator
2dd36d6d
ggerganov
approved these changes
on 2024-02-27
use builtin tolower
6b33a094
add locale include
9242cf14
Simplify to_lower cases
801abe52
ggerganov
merged
177628bf
into master 1 year ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub