llama.cpp
02c1ecad - Tokenizer WPM fixes (#7500)

Commit

1 year ago

Tokenizer WPM fixes (#7500) * Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found.

References

#7500 - Tokenizer WPM fixes for bert-bge and jina-v2-en

Author

jaime-m-p

Parents

6bd12ce4

Files2

llama.cpp
tests
- test-tokenizer-random.py

llama.cpp 02c1ecad - Tokenizer WPM fixes (#7500)

llama.cpp
02c1ecad - Tokenizer WPM fixes (#7500)