PR #2810 llama : more tokenizer fixes

llama : more tokenizer fixes #2810

ggerganov merged 13 commits into master from fix-tokenizer

tests : write a Python tokenizer test (wip)

5cad62bc

llama : prefix input text for tokenization with whitespace

5d0ffb69

klosax commented on 2023-08-26

llama : distinguish pieces from decoded text + fix detokenization

9668aa11

ggerganov force pushed 2 years ago

common : add comments

1e7a033f

ggerganov force pushed to 1e7a033f 2 years ago

examples : no longer manually add leading space when tokenizing

dfa058ef

tests : use Python to generate tokenizer tests for C++

70005bd5

tests : add option to tokenize text files

e4324cbd

ggerganov force pushed to e4324cbd 2 years ago

ggerganov marked this pull request as ready for review 2 years ago

ggerganov requested a review from

SlyEcho 2 years ago

ggerganov requested a review from

ikawrakow 2 years ago

ggerganov requested a review from

klosax 2 years ago

SlyEcho approved these changes on 2023-08-26

ggerganov commented on 2023-08-26

klosax approved these changes on 2023-08-26

tests : add test-tokenizer-1.py

eb8b3264

Merge branch 'master' into fix-tokenizer

c7677463

llama.cpp : fix LF token

ab3ba64f

hellaswag : move the concat space for clarity

dbcf470b

ggerganov force pushed to dbcf470b 2 years ago

ikawrakow approved these changes on 2023-08-27

tests : add falcon tests (py + cpp, currently do not pass Unicode)

3bb0f849

ggerganov force pushed to 3bb0f849 2 years ago

common : temporary separate llama_detokenize calls for SPM and BPE

841983fe

ggerganov merged edd4c148 into master 2 years ago

ggerganov deleted the fix-tokenizer branch 2 years ago

Reviewers

ikawrakow

klosax

SlyEcho

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

llama.cpp llama : more tokenizer fixes #2810 Merged

llama : more tokenizer fixes #2810

llama.cpp
llama : more tokenizer fixes
#2810

Merged