llama.cpp
Improve BERT tokenization for accented characters and non-latin scripts
#5740
Merged

Improve BERT tokenization for accented characters and non-latin scripts #5740

ggerganov merged 5 commits into ggml-org:master from iamlemec:tokenizer-fix
iamlemec
iamlemec implement nfd for stripping accents in wpm tokenizer
9c996e3d
cebtenzzre
cebtenzzre
cebtenzzre
cebtenzzre commented on 2024-02-26
iamlemec sort nfd map; reuse iterator
2dd36d6d
iamlemec
ggerganov
ggerganov approved these changes on 2024-02-27
ggerganov ggerganov requested a review from cebtenzzre cebtenzzre 1 year ago
cebtenzzre
cebtenzzre
cebtenzzre approved these changes on 2024-02-27
cebtenzzre
iamlemec use builtin tolower
6b33a094
iamlemec
cebtenzzre
cebtenzzre
cebtenzzre commented on 2024-02-27
iamlemec
iamlemec add locale include
9242cf14
cebtenzzre
cebtenzzre
cebtenzzre commented on 2024-02-27
iamlemec Simplify to_lower cases
801abe52
cebtenzzre
cebtenzzre approved these changes on 2024-02-27
ggerganov ggerganov merged 177628bf into master 1 year ago
ggerganov
iamlemec
iamlemec
cebtenzzre
cebtenzzre commented on 2024-03-25

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone