llama.cpp
Tokenizer BPE fixes
#7530
Merged

Tokenizer BPE fixes #7530

jaime-m-p
Update random test: add_bos_token
e013b231
Add BPE models for testing
55e387b2
bugfix: custom regex split fails with codepoint 0
fe3c5319
Refactor llm_tokenizer_bpe: move code to constructor
6f4c300b
Update random test: add_eos_token
614d0bb8
Add BPE models for testing
61683991
Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
0794b777
Fix falcon punctuation regex
51e933a9
github-actions github-actions added testing
github-actions github-actions added python
github-actions
mofosyne mofosyne added Review Complexity : Medium
mofosyne
bartowski1182
jaime-m-p
jaime-m-p
Better name functions to append token/bos/eos
1d2f3ad4
Move tokenizer flags to vocab structure.
c83ea1a1
Allow lstrip for 'added_tokens'
615f425a
bartowski1182
teleprint-me
teleprint-me commented on 2024-05-25
teleprint-me
teleprint-me commented on 2024-05-25
teleprint-me
teleprint-me commented on 2024-05-25
Default values for special_add_bos/eos
f84b04f1
teleprint-me
teleprint-me commented on 2024-05-25
teleprint-me
teleprint-me commented on 2024-05-25
teleprint-me
jaime-m-p
Fix default value for WPM special_add_eos
7a5578f2
ggerganov
ggerganov commented on 2024-05-26
ggerganov
ggerganov commented on 2024-05-26
Better variable names
173ab69d
ggerganov
jaime-m-p
Build vocab.special_tokens_cache using vocab token types
fef99155
jaime-m-p
ggerganov
jaime-m-p
ggerganov
Merge commit '148995e5' into tokenizer-bpe-fixes
d67de1a3
Generalize 'jina-v2' per token attributes
c863752c
Fix merge: 'smaug'
75840fe6
update brute force random test
f58de317
Fix 'jina-v2' per token attributes
974d40b5
Fix unicode whitespaces (deepseek-coder)
07530a8d
Fix unicode whitespaces (deepseek-llm)
4ff15d4f
Skip missing byte tokens (falcon)
05750239
Update brute force random test
8cda5af9
Better unicode data generation
4af5478f
github-actions github-actions added documentation
github-actions github-actions added build
github-actions github-actions added script
github-actions github-actions added android
github-actions github-actions added Nvidia GPU
github-actions github-actions added nix
github-actions github-actions added Vulkan
github-actions github-actions added examples
github-actions github-actions added devops
github-actions github-actions added server
github-actions github-actions added ggml
github-actions github-actions added SYCL
github-actions github-actions added Apple Metal
github-actions github-actions added Kompute
Merge branch 'master' into tokenizer-bpe-fixes
e28d0e41
Fix merge: renamed and deleted files
903e47f9
Replace char32_t with uint32_t
b7ee8270
jaime-m-p Merge branch 'master' into tokenizer-bpe-fixes
b8929d5f
jaime-m-p
ggerganov
ggerganov approved these changes on 2024-06-18
ggerganov
jaime-m-p
jaime-m-p jaime-m-p merged 37bef894 into master 1 year ago
teleprint-me
teleprint-me commented on 2024-06-18
teleprint-me
teleprint-me commented on 2024-06-18
teleprint-me
teleprint-me commented on 2024-06-21

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone