Refactor and modernize StringNormalizer. (#28320)
This pull request refactors and modernizes the UTF-8 and wide character
(wchar_t) string conversion logic in the string normalizer CPU kernel,
replacing deprecated and complex code with new, platform-appropriate
utilities. The changes improve code maintainability, portability, and
performance, especially on non-Windows platforms, by introducing custom
UTF-8 conversion routines and simplifying buffer management.
The most important changes are:
**UTF-8 and Wide Character Conversion Utilities:**
* Added new UTF-8 <-> wchar_t conversion functions
(`WideToUtf8RequiredSize`, `WideToUtf8`, `Utf8ToWide`, and
`Utf8ToWideString`) for non-Windows platforms in `utf8_util.h`, avoiding
deprecated `std::codecvt` and providing robust Unicode handling.
* Updated `Utf8ConverterGeneric` in `string_normalizer.cc` to use these
new utilities, greatly simplifying the code and removing
legacy/deprecated conversion logic.
**Code Simplification and Performance:**
* Simplified buffer size estimation for conversions: now directly uses
the UTF-8 string size as an upper bound for the wide buffer, removing
the need for a full decode pass just to compute buffer sizes.
* Improved comments and logic for case-insensitive filtering, clarifying
why lowercasing is used and how conversions are managed for efficiency.
[[1]](diffhunk://#diff-20cdc2200d64f7c8dba541825ed6de8e69c5aaf0c0ece6967d3613482d0aaf16L32-R39)
[[2]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8L614-R446)
**Cleanup and Modernization:**
* Removed all usage of deprecated `std::codecvt` and related workaround
code, as well as unnecessary includes and platform-specific handling,
resulting in cleaner and more maintainable code.
[[1]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8R8-L27)
[[2]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8L39-R57)
[[3]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8L419-L428)
These changes collectively modernize the string normalization kernel,
improve portability, and make the codebase easier to maintain.