llama.cpp
02c1ecad - Tokenizer WPM fixes (#7500)

Commit
1 year ago
Tokenizer WPM fixes (#7500) * Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found.
Author
Parents
  • File
    llama.cpp
  • tests
    • File
      test-tokenizer-random.py