llama.cpp
2d219b38 - vocab : ignore invalid UTF-8 input in the BPE tokenizer (#11729)

Commit
245 days ago
vocab : ignore invalid UTF-8 input in the BPE tokenizer (#11729) Silently insert U+FFFD(s) (Unicode replacement character) instead until the next valid codepoint can be found. This fixes `llama_tokenize` throwing an exception across the C API boundary or libllama's module boundary (the caller's runtime might be incompatible!) Returing a proper error code might be desirable, however the signature of `llama_tokenize` doesn't allow it as all return values already have existing meaning.
Author
Parents
Loading