onnxruntime
e3c34da4 - Refactor and modernize StringNormalizer. (#28320)

Commit

8 days ago

Refactor and modernize StringNormalizer. (#28320) This pull request refactors and modernizes the UTF-8 and wide character (wchar_t) string conversion logic in the string normalizer CPU kernel, replacing deprecated and complex code with new, platform-appropriate utilities. The changes improve code maintainability, portability, and performance, especially on non-Windows platforms, by introducing custom UTF-8 conversion routines and simplifying buffer management. The most important changes are: **UTF-8 and Wide Character Conversion Utilities:** * Added new UTF-8 <-> wchar_t conversion functions (`WideToUtf8RequiredSize`, `WideToUtf8`, `Utf8ToWide`, and `Utf8ToWideString`) for non-Windows platforms in `utf8_util.h`, avoiding deprecated `std::codecvt` and providing robust Unicode handling. * Updated `Utf8ConverterGeneric` in `string_normalizer.cc` to use these new utilities, greatly simplifying the code and removing legacy/deprecated conversion logic. **Code Simplification and Performance:** * Simplified buffer size estimation for conversions: now directly uses the UTF-8 string size as an upper bound for the wide buffer, removing the need for a full decode pass just to compute buffer sizes. * Improved comments and logic for case-insensitive filtering, clarifying why lowercasing is used and how conversions are managed for efficiency. [[1]](diffhunk://#diff-20cdc2200d64f7c8dba541825ed6de8e69c5aaf0c0ece6967d3613482d0aaf16L32-R39) [[2]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8L614-R446) **Cleanup and Modernization:** * Removed all usage of deprecated `std::codecvt` and related workaround code, as well as unnecessary includes and platform-specific handling, resulting in cleaner and more maintainable code. [[1]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8R8-L27) [[2]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8L39-R57) [[3]](diffhunk://#diff-26d2562f008c04f6d64a9c805054957c6a888040bd0912d5c16a53ed05512ca8L419-L428) These changes collectively modernize the string normalization kernel, improve portability, and make the codebase easier to maintain.

References

#28320 - Refactor and modernize StringNormalizer.

Author

yuslepukhin

Parents

19738c57

onnxruntime e3c34da4 - Refactor and modernize StringNormalizer. (#28320)

onnxruntime
e3c34da4 - Refactor and modernize StringNormalizer. (#28320)