llama.cpp
Improve BERT tokenization for accented characters and non-latin scripts
#5740

Merged

Improve BERT tokenization for accented characters and non-latin scripts #5740

ggerganov merged 5 commits into ggml-org:master from iamlemec:tokenizer-fix

implement nfd for stripping accents in wpm tokenizer

9c996e3d

cebtenzzre commented on 2024-02-26

sort nfd map; reuse iterator

2dd36d6d

ggerganov approved these changes on 2024-02-27

ggerganov requested a review from

cebtenzzre 2 years ago

cebtenzzre approved these changes on 2024-02-27

use builtin tolower

6b33a094

cebtenzzre commented on 2024-02-27

add locale include

9242cf14

cebtenzzre commented on 2024-02-27

Simplify to_lower cases

801abe52

cebtenzzre approved these changes on 2024-02-27

ggerganov merged 177628bf into master 2 years ago

cebtenzzre commented on 2024-03-25

Reviewers

ggerganov

cebtenzzre

Assignees

No one assigned

Labels

None yet

Milestone

No milestone