llama.cpp
f9d42c59 - convert_hf : identify more added control tokens for SPM tokenziers

Commit

1 year ago

convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens

References

#8228 - llama : fix pre-tokenization of non-special added tokens

Author

compilade

Parents

6e351e04

llama.cpp f9d42c59 - convert_hf : identify more added control tokens for SPM tokenziers

llama.cpp
f9d42c59 - convert_hf : identify more added control tokens for SPM tokenziers