transformers.js
1165f04a - Fix BPE tokenization for weird whitespace characters (Closes #199) (#208)

Comment changes are shownComment changes are hidden
Commit
1 year ago
Fix BPE tokenization for weird whitespace characters (Closes #199) (#208) * Add new tokenizer unit test (#199) * Perform `NFKC` normalization for sentencepiece models w/ precompiled charmap * Fix JSDoc indentation * Add problematic string to unit tests * Use consistent BPE split token * Add second problematic string
Author
Parents
  • src
    • File
      tokenizers.js
  • tests
    • File
      generate_tests.py