llama.cpp
llama : improve BPE pre-processing + LLaMA 3 and Deepseek support
#6920
Merged

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

ggerganov merged 61 commits into master from gg/bpe-preprocess
ggerganov
jaggzh merged the changes from deepseeker models to main branch
6fbab2db
dragnil1 Moved regex patterns to unicode.cpp and updated unicode.h
d2cfc222
dragnil1 Moved header files
54f93eb5
dragnil1 Resolved issues
1c924e4b
dragnil1 added and refactored unicode_regex_split and related functions
4056dc5b
jaggzh Updated/merged the deepseek coder pr
c8e7d952
dragnil1 Refactored code
4c3e882a
dragnil1 Adding unicode regex mappings
a5710a41
dragnil1 Adding unicode regex function
7e308ed2
dragnil1 Added needed functionality, testing remains
feeaf4f3
dragnil1 Fixed issues
75358036
dragnil1 Fixed issue with gpt2 regex custom preprocessor
36d98326
ggerganov
ggerganov unicode : fix? unicode_wstring_to_utf8
06d3e693
ggerganov
ggerganov commented on 2024-04-26
ggerganov lint : fix whitespaces
c56e19db
ggerganov tests : add tokenizer tests for numbers
7a44e443
ggerganov unicode : remove redundant headers
d999cf65
ggerganov tests : remove and rename tokenizer test scripts
aeafb43e
ggerganov tests : add sample usage
e1b2bf78
ggerganov gguf-py : reader prints warnings on duplicate keys
ed42711b
ggerganov llama : towards llama3 tokenization support (wip)
4907e41a
ggerganov unicode : shot in the dark to fix tests on Windows
e8c206be
ggerganov
ggerganov commented on 2024-04-26
ggerganov unicode : first try custom implementations
e9891769
github-actions
ggerganov Merge branch 'master' into gg/bpe-preprocess
e3f6dc74
ggerganov convert : add "tokenizer.ggml.pre" GGUF KV (wip)
9b4d63ae
m18coppola
ggerganov llama : use new pre-tokenizer type
43e12ce8
ggerganov convert : fix pre-tokenizer type writing
1b9b79dd
bartowski1182
ggerganov
ggerganov lint : fix
8791e94e
ggerganov make : add test-tokenizer-0-llama-v3
a774d708
ggerganov
ggerganov wip
c160818e
dragnil1
dragnil1
dragnil1 commented on 2024-04-26
dragnil1
dragnil1 commented on 2024-04-26
ddh0
ryao
ggerganov models : add llama v3 vocab file
96965f67
ggerganov llama : adapt punctuation regex + add llama 3 regex
ad929833
ggerganov minor
4434c9d6
ggerganov
ggerganov unicode : set bomb
a22645c2
ggerganov unicode : set bomb
2affd0b2
dragnil1
ggerganov
dragnil1
ggerganov unicode : always use std::wregex
ce5485ae
ggerganov unicode : support \p{N}, \p{L} and \p{P} natively
91eaa414
ggerganov unicode : try fix windows
581c4a02
dragnil1
ggerganov
henk717
ggerganov
belladoreai
ggerganov unicode : category support via std::regex
b97add52
ggerganov ggerganov force pushed to b97add52 1 year ago
ggerganov Merge branch 'master' into gg/bpe-preprocess
d63cc906
ggerganov
ggerganov ggerganov added high priority
ggerganov ggerganov added need feedback
ggerganov unicode : clean-up
e972e6cb
ggerganov unicode : simplify
ee6d1b3f
dragnil1
dragnil1 commented on 2024-04-28
ggerganov convert : add convert-hf-to-gguf-update.py
76429736
ggerganov ggerganov force pushed to 76429736 1 year ago
ggerganov lint : update
4e3e6d8e
ggerganov convert : add falcon
1c888eb4
ggerganov unicode : normalize signatures
1545550e
ggerganov lint : fix
491f2339
compilade
compilade commented on 2024-04-28
ggerganov lint : fix
e8dd4a14
ggerganov convert : remove unused functions
02fd977f
ggerganov convert : add comments
0f9058ce
compilade
compilade commented on 2024-04-28
ggerganov convert : exercise contractions
78081502
ggerganov lint : fix
7b1210f6
ggerganov
clearsitedesigns
dragnil1
dragnil1 commented on 2024-04-28
ggerganov cmake : refactor test targets
ef4cca9e
ggerganov tests : refactor vocab tests
43708d22
ggerganov tests : add more vocabs and tests
c68d2596
ggerganov ggerganov force pushed to c68d2596 1 year ago
ggerganov unicode : cleanup
af05268c
ggerganov scripts : ignore new update script in check-requirements.sh
c21ab183
ggerganov ggerganov marked this pull request as ready for review 1 year ago
ggerganov models : add phi-3, mpt, gpt-2, starcoder
120cf37d
ggerganov tests : disable obsolete
9a7d430f
ggerganov tests : use faster bpe test
6d6ce939
ggerganov llama : more prominent warning for old BPE models
3202676f
ggerganov tests : disable test-tokenizer-1-bpe due to slowness
80cb3127
ggerganov ggerganov merged f4ab2a41 into master 1 year ago
ggerganov
Tonic3
Galunid
RachidAR
Galunid
JohannesGaessler
teleprint-me
JohannesGaessler
MoonRide303
JohannesGaessler
bartowski1182
ggerganov
jeanromainroy
USBhost
arch-btw
JohannesGaessler
USBhost
arch-btw
clearsitedesigns
ContinuumOperand
teleprint-me
ContinuumOperand
Tonic3
teleprint-me
BrickBee
teleprint-me
ggerganov
BrickBee
sais-github
JohannesGaessler
segmond
teknium1
Sumandora
Tonic3
teleprint-me
sealad886
teleprint-me
USBhost
kallewoof
BramVanroy
teleprint-me
BramVanroy
teleprint-me
MoonRide303
Imaniac230
nkeilar
x4080
mofosyne mofosyne added enhancement
raphael10-collab

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone