llama : more tokenizer fixes (#2810)

Commit

2 years ago

llama : more tokenizer fixes (#2810) * tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>

References

#2810 - llama : more tokenizer fixes

Author

ggerganov

Parents

1591e2e5

llama.cpp edd4c148 - llama : more tokenizer fixes (#2810)

llama.cpp
edd4c148 - llama : more tokenizer fixes (#2810)