llama.cpp
Tokenizer BPE fixes
#7530
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
29
Changes
View On
GitHub
Tokenizer BPE fixes
#7530
jaime-m-p
merged 29 commits into
ggml-org:master
from
jaime-m-p:tokenizer-bpe-fixes
Update random test: add_bos_token
e013b231
Add BPE models for testing
55e387b2
bugfix: custom regex split fails with codepoint 0
fe3c5319
Refactor llm_tokenizer_bpe: move code to constructor
6f4c300b
Update random test: add_eos_token
614d0bb8
Add BPE models for testing
61683991
Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
0794b777
Fix falcon punctuation regex
51e933a9
github-actions
added
testing
github-actions
added
python
mofosyne
added
Review Complexity : Medium
Better name functions to append token/bos/eos
1d2f3ad4
Move tokenizer flags to vocab structure.
c83ea1a1
Allow lstrip for 'added_tokens'
615f425a
teleprint-me
commented on 2024-05-25
teleprint-me
commented on 2024-05-25
teleprint-me
commented on 2024-05-25
Default values for special_add_bos/eos
f84b04f1
teleprint-me
commented on 2024-05-25
teleprint-me
commented on 2024-05-25
Fix default value for WPM special_add_eos
7a5578f2
ggerganov
commented on 2024-05-26
ggerganov
commented on 2024-05-26
Better variable names
173ab69d
Build vocab.special_tokens_cache using vocab token types
fef99155
Merge commit '148995e5' into tokenizer-bpe-fixes
d67de1a3
Generalize 'jina-v2' per token attributes
c863752c
Fix merge: 'smaug'
75840fe6
update brute force random test
f58de317
Fix 'jina-v2' per token attributes
974d40b5
Fix unicode whitespaces (deepseek-coder)
07530a8d
Fix unicode whitespaces (deepseek-llm)
4ff15d4f
Skip missing byte tokens (falcon)
05750239
Update brute force random test
8cda5af9
Better unicode data generation
4af5478f
github-actions
added
documentation
github-actions
added
build
github-actions
added
script
github-actions
added
android
github-actions
added
Nvidia GPU
github-actions
added
nix
github-actions
added
Vulkan
github-actions
added
examples
github-actions
added
devops
github-actions
added
server
github-actions
added
ggml
github-actions
added
SYCL
github-actions
added
Apple Metal
github-actions
added
Kompute
Merge branch 'master' into tokenizer-bpe-fixes
e28d0e41
Fix merge: renamed and deleted files
903e47f9
Replace char32_t with uint32_t
b7ee8270
Merge branch 'master' into tokenizer-bpe-fixes
b8929d5f
ggerganov
approved these changes on 2024-06-18
jaime-m-p
merged
37bef894
into master
1 year ago
teleprint-me
commented on 2024-06-18
teleprint-me
commented on 2024-06-18
teleprint-me
commented on 2024-06-21
Login to write a write a comment.
Login via GitHub
Reviewers
ggerganov
teleprint-me
Assignees
No one assigned
Labels
documentation
build
script
testing
android
Nvidia GPU
nix
Vulkan
examples
python
Review Complexity : Medium
devops
server
ggml
SYCL
Apple Metal
Kompute
Milestone
No milestone
Login to write a write a comment.
Login via GitHub