llama.cpp
llama3 custom regex split
#6965
Merged

llama3 custom regex split #6965

jaime-m-p
jaggzh merged the changes from deepseeker models to main branch
6fbab2db
dragnil1 Moved regex patterns to unicode.cpp and updated unicode.h
d2cfc222
dragnil1 Moved header files
54f93eb5
dragnil1 Resolved issues
1c924e4b
dragnil1 added and refactored unicode_regex_split and related functions
4056dc5b
jaggzh Updated/merged the deepseek coder pr
c8e7d952
dragnil1 Refactored code
4c3e882a
dragnil1 Adding unicode regex mappings
a5710a41
dragnil1 Adding unicode regex function
7e308ed2
dragnil1 Added needed functionality, testing remains
feeaf4f3
dragnil1 Fixed issues
75358036
dragnil1 Fixed issue with gpt2 regex custom preprocessor
36d98326
ggerganov unicode : fix? unicode_wstring_to_utf8
06d3e693
ggerganov lint : fix whitespaces
c56e19db
ggerganov tests : add tokenizer tests for numbers
7a44e443
ggerganov unicode : remove redundant headers
d999cf65
ggerganov tests : remove and rename tokenizer test scripts
aeafb43e
ggerganov tests : add sample usage
e1b2bf78
ggerganov gguf-py : reader prints warnings on duplicate keys
ed42711b
ggerganov llama : towards llama3 tokenization support (wip)
4907e41a
ggerganov unicode : shot in the dark to fix tests on Windows
e8c206be
ggerganov unicode : first try custom implementations
e9891769
ggerganov Merge branch 'master' into gg/bpe-preprocess
e3f6dc74
ggerganov convert : add "tokenizer.ggml.pre" GGUF KV (wip)
9b4d63ae
ggerganov llama : use new pre-tokenizer type
43e12ce8
ggerganov convert : fix pre-tokenizer type writing
1b9b79dd
ggerganov lint : fix
8791e94e
ggerganov make : add test-tokenizer-0-llama-v3
a774d708
ggerganov wip
c160818e
ggerganov models : add llama v3 vocab file
96965f67
ggerganov llama : adapt punctuation regex + add llama 3 regex
ad929833
ggerganov minor
4434c9d6
ggerganov unicode : set bomb
a22645c2
ggerganov unicode : set bomb
2affd0b2
ggerganov unicode : always use std::wregex
ce5485ae
ggerganov unicode : support \p{N}, \p{L} and \p{P} natively
91eaa414
ggerganov unicode : try fix windows
581c4a02
ggerganov unicode : category support via std::regex
b97add52
ggerganov Merge branch 'master' into gg/bpe-preprocess
d63cc906
ggerganov unicode : clean-up
e972e6cb
ggerganov unicode : simplify
ee6d1b3f
llama3 custom regex split
e11fe2fb
ggerganov convert : add convert-hf-to-gguf-update.py
76429736
ggerganov ggerganov force-pushed the gg/bpe-preprocess branch from 18381f14 to 76429736 1 year ago
ggerganov lint : update
4e3e6d8e
ggerganov convert : add falcon
1c888eb4
ggerganov unicode : normalize signatures
1545550e
ggerganov lint : fix
491f2339
ggerganov lint : fix
e8dd4a14
ggerganov convert : remove unused functions
02fd977f
ggerganov convert : add comments
0f9058ce
ggerganov convert : exercise contractions
78081502
Using char32_t for codepoints
5cc4b2cf
ggerganov lint : fix
7b1210f6
already exists unicode_tolower()
6e4d2af6
Typing
2a488739
Restore BOM
0cf9ed34
ggerganov cmake : refactor test targets
ef4cca9e
ggerganov tests : refactor vocab tests
43708d22
ggerganov tests : add more vocabs and tests
c68d2596
ggerganov unicode : cleanup
af05268c
ggerganov scripts : ignore new update script in check-requirements.sh
c21ab183
jaime-m-p Merge branch 'ggerganov:gg/bpe-preprocess' into gg/bpe-preprocess
866e3941
Fix merge
a0c870db
ggerganov models : add phi-3, mpt, gpt-2, starcoder
120cf37d
ggerganov
ggerganov tests : disable obsolete
9a7d430f
ggerganov tests : use faster bpe test
6d6ce939
ggerganov llama : more prominent warning for old BPE models
3202676f
ggerganov tests : disable test-tokenizer-1-bpe due to slowness
80cb3127
Merge remote-tracking branch 'upstream/gg/bpe-preprocess' into gg/bpe…
b66cdd1c
Move unused variable value
5c38f6ed
GPT2 custom regex split
1d8fcc06
jaime-m-p
reneleonhardt
slaren
reneleonhardt
slaren
bartowski1182
jaime-m-p
ggerganov
ggerganov commented on 2024-04-30
ggerganov
ggerganov
ggerganov commented on 2024-04-30
ggerganov
jaime-m-p Add alternative regex for custom aplit llama3
2cd1eb0d
Style
0c6d820b
jaime-m-p
s-kostyaev
Add bruteforce random tests for token encoding
3e3e2838
wip: fixing unicode codepoint ranges
4d441e4a
jaime-m-p
jaime-m-p
ggerganov
Merge remote-tracking branch 'upstream/master' into gg/bpe-preprocess
798b576c
github-actions
Fix merge
69a49ac3
Unicode tables: separator, lowercase, uppercase and whitespace
8fd849eb
llama3 custom regex split: fix \s
67832e55
jaime-m-p
jaime-m-p
jaime-m-p
jaime-m-p
Restore BOM
edf375d2
ggerganov
reneleonhardt
Style
a5fa2fec
wip: generate NDF table
def3d13a
Ignore special tokens for testing
7761f8ea
ggerganov ggerganov changed the base branch from gg/bpe-preprocess to master 1 year ago
ggerganov
Clean gen-unicode-data.py
70ca1fe2
Refactor random tokenizer test
77cbb795
jaime-m-p
jaime-m-p jaime-m-p closed this 1 year ago
jaime-m-p jaime-m-p reopened this 1 year ago
jaime-m-p Merge branch 'master' into gg/bpe-preprocess
ea471197
ggerganov ggerganov requested a review from ggerganov ggerganov 1 year ago
ggerganov lint : fix
8de8b6d1
ggerganov tests : add fail test for llama-bpe
12a7b696
ggerganov ggerganov force pushed from 9d346d06 to 12a7b696 1 year ago
ggerganov
ggerganov approved these changes on 2024-05-09
mofosyne mofosyne added enhancement
mofosyne mofosyne added Review Complexity : Medium
mofosyne
mofosyne mofosyne merged 43248e55 into master 1 year ago
jaime-m-p

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone