unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257)
* unicode : add custom Qwen2 regex handler to fix segfault on long input
std::regex uses recursive backtracking internally, which causes a stack
overflow (segfault) when tokenizing long sequences of repeated characters
(e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in
the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to
the std::regex fallback path instead of using a custom handler.
Add unicode_regex_split_custom_qwen2() following the established pattern
used by gpt2, llama3, kimi_k2, and afmoe custom handlers.
Closes: https://github.com/ggml-org/llama.cpp/issues/21113
* cont : remove TODO comment
* cont : update comment to reflect original regex
* use the correct regex in the comment this time... [no ci]
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>