llama.cpp
llama : improve BPE pre-processing + LLaMA 3 and Deepseek support
#6920

Merged

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

ggerganov merged 61 commits into master from gg/bpe-preprocess

merged the changes from deepseeker models to main branch

6fbab2db

Moved regex patterns to unicode.cpp and updated unicode.h

d2cfc222

Moved header files

54f93eb5

Resolved issues

1c924e4b

added and refactored unicode_regex_split and related functions

4056dc5b

Updated/merged the deepseek coder pr

c8e7d952

Refactored code

4c3e882a

Adding unicode regex mappings

a5710a41

Adding unicode regex function

7e308ed2

Added needed functionality, testing remains

feeaf4f3

Fixed issues

75358036

Fixed issue with gpt2 regex custom preprocessor

36d98326

unicode : fix? unicode_wstring_to_utf8

06d3e693

ggerganov commented on 2024-04-26

lint : fix whitespaces

c56e19db

tests : add tokenizer tests for numbers

7a44e443

unicode : remove redundant headers

d999cf65

tests : remove and rename tokenizer test scripts

aeafb43e

tests : add sample usage

e1b2bf78

gguf-py : reader prints warnings on duplicate keys

ed42711b

llama : towards llama3 tokenization support (wip)

4907e41a

unicode : shot in the dark to fix tests on Windows

e8c206be

ggerganov commented on 2024-04-26

unicode : first try custom implementations

e9891769

Merge branch 'master' into gg/bpe-preprocess

e3f6dc74

convert : add "tokenizer.ggml.pre" GGUF KV (wip)

9b4d63ae

llama : use new pre-tokenizer type

43e12ce8

convert : fix pre-tokenizer type writing

1b9b79dd

lint : fix

8791e94e

make : add test-tokenizer-0-llama-v3

a774d708

wip

c160818e

dragnil1 commented on 2024-04-26

models : add llama v3 vocab file

96965f67

llama : adapt punctuation regex + add llama 3 regex

ad929833

minor

4434c9d6

unicode : set bomb

a22645c2

unicode : set bomb

2affd0b2

unicode : always use std::wregex

ce5485ae

unicode : support \p{N}, \p{L} and \p{P} natively

91eaa414

unicode : try fix windows

581c4a02

unicode : category support via std::regex

b97add52

ggerganov force pushed to b97add52 1 year ago

Merge branch 'master' into gg/bpe-preprocess

d63cc906

ggerganov added high priority

ggerganov added need feedback

unicode : clean-up

e972e6cb

unicode : simplify

ee6d1b3f

dragnil1 commented on 2024-04-28

convert : add convert-hf-to-gguf-update.py

76429736

ggerganov force pushed to 76429736 1 year ago

lint : update

4e3e6d8e

convert : add falcon

1c888eb4

unicode : normalize signatures

1545550e

lint : fix

491f2339

compilade commented on 2024-04-28

lint : fix

e8dd4a14

convert : remove unused functions

02fd977f

convert : add comments

0f9058ce

compilade commented on 2024-04-28

convert : exercise contractions

78081502

lint : fix

7b1210f6

dragnil1 commented on 2024-04-28

cmake : refactor test targets

ef4cca9e

tests : refactor vocab tests

43708d22

tests : add more vocabs and tests

c68d2596

ggerganov force pushed to c68d2596 1 year ago

unicode : cleanup

af05268c

scripts : ignore new update script in check-requirements.sh

c21ab183

ggerganov marked this pull request as ready for review 1 year ago

models : add phi-3, mpt, gpt-2, starcoder

120cf37d

tests : disable obsolete

9a7d430f

tests : use faster bpe test

6d6ce939

llama : more prominent warning for old BPE models

3202676f

tests : disable test-tokenizer-1-bpe due to slowness

80cb3127

ggerganov merged f4ab2a41 into master 1 year ago

mofosyne added enhancement

Reviewers

compilade

teleprint-me

dragnil1

Sumanai

coder543

Assignees

No one assigned

Labels

enhancement high priority need feedback

Milestone

No milestone

llama.cpp llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920 Merged

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

llama.cpp
llama : improve BPE pre-processing + LLaMA 3 and Deepseek support
#6920

Merged