llama.cpp
llama : improve BPE pre-processing + LLaMA 3 and Deepseek support
#6920
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
61
Changes
View On
GitHub
llama : improve BPE pre-processing + LLaMA 3 and Deepseek support
#6920
ggerganov
merged 61 commits into
master
from
gg/bpe-preprocess
merged the changes from deepseeker models to main branch
6fbab2db
Moved regex patterns to unicode.cpp and updated unicode.h
d2cfc222
Moved header files
54f93eb5
Resolved issues
1c924e4b
added and refactored unicode_regex_split and related functions
4056dc5b
Updated/merged the deepseek coder pr
c8e7d952
Refactored code
4c3e882a
Adding unicode regex mappings
a5710a41
Adding unicode regex function
7e308ed2
Added needed functionality, testing remains
feeaf4f3
Fixed issues
75358036
Fixed issue with gpt2 regex custom preprocessor
36d98326
unicode : fix? unicode_wstring_to_utf8
06d3e693
ggerganov
commented on 2024-04-26
lint : fix whitespaces
c56e19db
tests : add tokenizer tests for numbers
7a44e443
unicode : remove redundant headers
d999cf65
tests : remove and rename tokenizer test scripts
aeafb43e
tests : add sample usage
e1b2bf78
gguf-py : reader prints warnings on duplicate keys
ed42711b
llama : towards llama3 tokenization support (wip)
4907e41a
unicode : shot in the dark to fix tests on Windows
e8c206be
ggerganov
commented on 2024-04-26
unicode : first try custom implementations
e9891769
Merge branch 'master' into gg/bpe-preprocess
e3f6dc74
convert : add "tokenizer.ggml.pre" GGUF KV (wip)
9b4d63ae
llama : use new pre-tokenizer type
43e12ce8
convert : fix pre-tokenizer type writing
1b9b79dd
lint : fix
8791e94e
make : add test-tokenizer-0-llama-v3
a774d708
wip
c160818e
dragnil1
commented on 2024-04-26
dragnil1
commented on 2024-04-26
models : add llama v3 vocab file
96965f67
llama : adapt punctuation regex + add llama 3 regex
ad929833
minor
4434c9d6
unicode : set bomb
a22645c2
unicode : set bomb
2affd0b2
unicode : always use std::wregex
ce5485ae
unicode : support \p{N}, \p{L} and \p{P} natively
91eaa414
unicode : try fix windows
581c4a02
unicode : category support via std::regex
b97add52
ggerganov
force pushed
to
b97add52
1 year ago
Merge branch 'master' into gg/bpe-preprocess
d63cc906
ggerganov
added
high priority
ggerganov
added
need feedback
unicode : clean-up
e972e6cb
unicode : simplify
ee6d1b3f
dragnil1
commented on 2024-04-28
convert : add convert-hf-to-gguf-update.py
76429736
ggerganov
force pushed
to
76429736
1 year ago
lint : update
4e3e6d8e
convert : add falcon
1c888eb4
unicode : normalize signatures
1545550e
lint : fix
491f2339
compilade
commented on 2024-04-28
lint : fix
e8dd4a14
convert : remove unused functions
02fd977f
convert : add comments
0f9058ce
compilade
commented on 2024-04-28
convert : exercise contractions
78081502
lint : fix
7b1210f6
dragnil1
commented on 2024-04-28
cmake : refactor test targets
ef4cca9e
tests : refactor vocab tests
43708d22
tests : add more vocabs and tests
c68d2596
ggerganov
force pushed
to
c68d2596
1 year ago
unicode : cleanup
af05268c
scripts : ignore new update script in check-requirements.sh
c21ab183
ggerganov
marked this pull request as ready for review
1 year ago
models : add phi-3, mpt, gpt-2, starcoder
120cf37d
tests : disable obsolete
9a7d430f
tests : use faster bpe test
6d6ce939
llama : more prominent warning for old BPE models
3202676f
tests : disable test-tokenizer-1-bpe due to slowness
80cb3127
ggerganov
merged
f4ab2a41
into master
1 year ago
mofosyne
added
enhancement
Login to write a write a comment.
Login via GitHub
Reviewers
compilade
teleprint-me
dragnil1
Sumanai
coder543
Assignees
No one assigned
Labels
enhancement
high priority
need feedback
Milestone
No milestone
Login to write a write a comment.
Login via GitHub