PR #6965 llama3 custom regex split

llama3 custom regex split #6965

mofosyne merged 88 commits into ggml-org:master from jaime-m-p:gg/bpe-preprocess

merged the changes from deepseeker models to main branch

6fbab2db

Moved regex patterns to unicode.cpp and updated unicode.h

d2cfc222

Moved header files

54f93eb5

Resolved issues

1c924e4b

added and refactored unicode_regex_split and related functions

4056dc5b

Updated/merged the deepseek coder pr

c8e7d952

Refactored code

4c3e882a

Adding unicode regex mappings

a5710a41

Adding unicode regex function

7e308ed2

Added needed functionality, testing remains

feeaf4f3

Fixed issues

75358036

Fixed issue with gpt2 regex custom preprocessor

36d98326

unicode : fix? unicode_wstring_to_utf8

06d3e693

lint : fix whitespaces

c56e19db

tests : add tokenizer tests for numbers

7a44e443

unicode : remove redundant headers

d999cf65

tests : remove and rename tokenizer test scripts

aeafb43e

tests : add sample usage

e1b2bf78

gguf-py : reader prints warnings on duplicate keys

ed42711b

llama : towards llama3 tokenization support (wip)

4907e41a

unicode : shot in the dark to fix tests on Windows

e8c206be

unicode : first try custom implementations

e9891769

Merge branch 'master' into gg/bpe-preprocess

e3f6dc74

convert : add "tokenizer.ggml.pre" GGUF KV (wip)

9b4d63ae

llama : use new pre-tokenizer type

43e12ce8

convert : fix pre-tokenizer type writing

1b9b79dd

lint : fix

8791e94e

make : add test-tokenizer-0-llama-v3

a774d708

wip

c160818e

models : add llama v3 vocab file

96965f67

llama : adapt punctuation regex + add llama 3 regex

ad929833

minor

4434c9d6

unicode : set bomb

a22645c2

unicode : set bomb

2affd0b2

unicode : always use std::wregex

ce5485ae

unicode : support \p{N}, \p{L} and \p{P} natively

91eaa414

unicode : try fix windows

581c4a02

unicode : category support via std::regex

b97add52

Merge branch 'master' into gg/bpe-preprocess

d63cc906

unicode : clean-up

e972e6cb

unicode : simplify

ee6d1b3f

llama3 custom regex split

e11fe2fb

convert : add convert-hf-to-gguf-update.py

76429736

ggerganov force-pushed the gg/bpe-preprocess branch to 76429736 1 year ago

lint : update

4e3e6d8e

convert : add falcon

1c888eb4

unicode : normalize signatures

1545550e

lint : fix

491f2339

lint : fix

e8dd4a14

convert : remove unused functions

02fd977f

convert : add comments

0f9058ce

convert : exercise contractions

78081502

Using char32_t for codepoints

5cc4b2cf

lint : fix

7b1210f6

already exists unicode_tolower()

6e4d2af6

Typing

2a488739

Restore BOM

0cf9ed34

cmake : refactor test targets

ef4cca9e

tests : refactor vocab tests

43708d22

tests : add more vocabs and tests

c68d2596

unicode : cleanup

af05268c

scripts : ignore new update script in check-requirements.sh

c21ab183

Merge branch 'ggerganov:gg/bpe-preprocess' into gg/bpe-preprocess

866e3941

Fix merge

a0c870db

models : add phi-3, mpt, gpt-2, starcoder

120cf37d

tests : disable obsolete

9a7d430f

tests : use faster bpe test

6d6ce939

llama : more prominent warning for old BPE models

3202676f

tests : disable test-tokenizer-1-bpe due to slowness

80cb3127

Merge remote-tracking branch 'upstream/gg/bpe-preprocess' into gg/bpe…

b66cdd1c

Move unused variable value

5c38f6ed

GPT2 custom regex split

1d8fcc06

ggerganov commented on 2024-04-30

Add alternative regex for custom aplit llama3

2cd1eb0d

Style

0c6d820b

Add bruteforce random tests for token encoding

3e3e2838

wip: fixing unicode codepoint ranges

4d441e4a

Merge remote-tracking branch 'upstream/master' into gg/bpe-preprocess

798b576c

Fix merge

69a49ac3

Unicode tables: separator, lowercase, uppercase and whitespace

8fd849eb

llama3 custom regex split: fix \s

67832e55

Restore BOM

edf375d2

Style

a5fa2fec

wip: generate NDF table

def3d13a

Ignore special tokens for testing

7761f8ea

ggerganov changed the base branch from gg/bpe-preprocess to master 1 year ago

Clean gen-unicode-data.py

70ca1fe2

Refactor random tokenizer test

77cbb795

jaime-m-p closed this 1 year ago

jaime-m-p reopened this 1 year ago

Merge branch 'master' into gg/bpe-preprocess

ea471197

ggerganov requested a review from

ggerganov 1 year ago

lint : fix

8de8b6d1

tests : add fail test for llama-bpe

12a7b696

ggerganov force pushed to 12a7b696 1 year ago

ggerganov approved these changes on 2024-05-09

mofosyne added enhancement

mofosyne added Review Complexity : Medium

mofosyne merged 43248e55 into master 1 year ago

Reviewers

ggerganov

Assignees

No one assigned

Labels

enhancement Review Complexity : Medium

Milestone

No milestone

llama.cpp llama3 custom regex split #6965 Merged

llama3 custom regex split #6965

llama.cpp
llama3 custom regex split
#6965

Merged