llama : more tokenizer fixes #2810
tests : write a Python tokenizer test (wip)
5cad62bc
llama : prefix input text for tokenization with whitespace
5d0ffb69
klosax
commented
on 2023-08-26
llama : distinguish pieces from decoded text + fix detokenization
9668aa11
common : add comments
1e7a033f
ggerganov
force pushed
to
1e7a033f
2 years ago
examples : no longer manually add leading space when tokenizing
dfa058ef
tests : use Python to generate tokenizer tests for C++
70005bd5
tests : add option to tokenize text files
e4324cbd
ggerganov
force pushed
to
e4324cbd
2 years ago
ggerganov
marked this pull request as ready for review 2 years ago
SlyEcho
approved these changes
on 2023-08-26
klosax
approved these changes
on 2023-08-26
tests : add test-tokenizer-1.py
eb8b3264
Merge branch 'master' into fix-tokenizer
c7677463
llama.cpp : fix LF token
ab3ba64f
hellaswag : move the concat space for clarity
dbcf470b
ggerganov
force pushed
to
dbcf470b
2 years ago
ikawrakow
approved these changes
on 2023-08-27
tests : add falcon tests (py + cpp, currently do not pass Unicode)
3bb0f849
ggerganov
force pushed
to
3bb0f849
2 years ago
common : temporary separate llama_detokenize calls for SPM and BPE
841983fe
ggerganov
merged
edd4c148
into master 2 years ago
ggerganov
deleted the fix-tokenizer branch 2 years ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub