llama.cpp
llama3 custom regex split
#6965
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
88
Changes
View On
GitHub
llama3 custom regex split
#6965
mofosyne
merged 88 commits into
ggml-org:master
from
jaime-m-p:gg/bpe-preprocess
merged the changes from deepseeker models to main branch
6fbab2db
Moved regex patterns to unicode.cpp and updated unicode.h
d2cfc222
Moved header files
54f93eb5
Resolved issues
1c924e4b
added and refactored unicode_regex_split and related functions
4056dc5b
Updated/merged the deepseek coder pr
c8e7d952
Refactored code
4c3e882a
Adding unicode regex mappings
a5710a41
Adding unicode regex function
7e308ed2
Added needed functionality, testing remains
feeaf4f3
Fixed issues
75358036
Fixed issue with gpt2 regex custom preprocessor
36d98326
unicode : fix? unicode_wstring_to_utf8
06d3e693
lint : fix whitespaces
c56e19db
tests : add tokenizer tests for numbers
7a44e443
unicode : remove redundant headers
d999cf65
tests : remove and rename tokenizer test scripts
aeafb43e
tests : add sample usage
e1b2bf78
gguf-py : reader prints warnings on duplicate keys
ed42711b
llama : towards llama3 tokenization support (wip)
4907e41a
unicode : shot in the dark to fix tests on Windows
e8c206be
unicode : first try custom implementations
e9891769
Merge branch 'master' into gg/bpe-preprocess
e3f6dc74
convert : add "tokenizer.ggml.pre" GGUF KV (wip)
9b4d63ae
llama : use new pre-tokenizer type
43e12ce8
convert : fix pre-tokenizer type writing
1b9b79dd
lint : fix
8791e94e
make : add test-tokenizer-0-llama-v3
a774d708
wip
c160818e
models : add llama v3 vocab file
96965f67
llama : adapt punctuation regex + add llama 3 regex
ad929833
minor
4434c9d6
unicode : set bomb
a22645c2
unicode : set bomb
2affd0b2
unicode : always use std::wregex
ce5485ae
unicode : support \p{N}, \p{L} and \p{P} natively
91eaa414
unicode : try fix windows
581c4a02
unicode : category support via std::regex
b97add52
Merge branch 'master' into gg/bpe-preprocess
d63cc906
unicode : clean-up
e972e6cb
unicode : simplify
ee6d1b3f
llama3 custom regex split
e11fe2fb
convert : add convert-hf-to-gguf-update.py
76429736
ggerganov
force-pushed the
gg/bpe-preprocess
branch
from
18381f14
to
76429736
1 year ago
lint : update
4e3e6d8e
convert : add falcon
1c888eb4
unicode : normalize signatures
1545550e
lint : fix
491f2339
lint : fix
e8dd4a14
convert : remove unused functions
02fd977f
convert : add comments
0f9058ce
convert : exercise contractions
78081502
Using char32_t for codepoints
5cc4b2cf
lint : fix
7b1210f6
already exists unicode_tolower()
6e4d2af6
Typing
2a488739
Restore BOM
0cf9ed34
cmake : refactor test targets
ef4cca9e
tests : refactor vocab tests
43708d22
tests : add more vocabs and tests
c68d2596
unicode : cleanup
af05268c
scripts : ignore new update script in check-requirements.sh
c21ab183
Merge branch 'ggerganov:gg/bpe-preprocess' into gg/bpe-preprocess
866e3941
Fix merge
a0c870db
models : add phi-3, mpt, gpt-2, starcoder
120cf37d
tests : disable obsolete
9a7d430f
tests : use faster bpe test
6d6ce939
llama : more prominent warning for old BPE models
3202676f
tests : disable test-tokenizer-1-bpe due to slowness
80cb3127
Merge remote-tracking branch 'upstream/gg/bpe-preprocess' into gg/bpe…
b66cdd1c
Move unused variable value
5c38f6ed
GPT2 custom regex split
1d8fcc06
ggerganov
commented on 2024-04-30
ggerganov
commented on 2024-04-30
Add alternative regex for custom aplit llama3
2cd1eb0d
Style
0c6d820b
Add bruteforce random tests for token encoding
3e3e2838
wip: fixing unicode codepoint ranges
4d441e4a
Merge remote-tracking branch 'upstream/master' into gg/bpe-preprocess
798b576c
Fix merge
69a49ac3
Unicode tables: separator, lowercase, uppercase and whitespace
8fd849eb
llama3 custom regex split: fix \s
67832e55
Restore BOM
edf375d2
Style
a5fa2fec
wip: generate NDF table
def3d13a
Ignore special tokens for testing
7761f8ea
ggerganov
changed the base branch from
gg/bpe-preprocess
to
master
1 year ago
Clean gen-unicode-data.py
70ca1fe2
Refactor random tokenizer test
77cbb795
jaime-m-p
closed this
1 year ago
jaime-m-p
reopened this
1 year ago
Merge branch 'master' into gg/bpe-preprocess
ea471197
ggerganov
requested a review
from
ggerganov
1 year ago
lint : fix
8de8b6d1
tests : add fail test for llama-bpe
12a7b696
ggerganov
force pushed
from
9d346d06
to
12a7b696
1 year ago
ggerganov
approved these changes on 2024-05-09
mofosyne
added
enhancement
mofosyne
added
Review Complexity : Medium
mofosyne
merged
43248e55
into master
1 year ago
Login to write a write a comment.
Login via GitHub
Reviewers
ggerganov
Assignees
No one assigned
Labels
enhancement
Review Complexity : Medium
Milestone
No milestone
Login to write a write a comment.
Login via GitHub