textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters (PR #9545)
Closes #8981
On Windows, wide characters are two bytes in size. This is also the case in python 2. This is best explained with an example:
Python 2:
>>> len(u"😉")
2
In Python 3 however, strings are saved using a variable byte size, based on the number of bytes that is needed to store the highest code point in the string. One index always corresponds with one code point.
A much more detailed description of the problem can be found in #8981.
This commit introduces a new textUtils module that intends to mitigate issues introduced with the Python 3 transition. Most offset based TextInfos are based on a two bytes wide character string representation. For example, uniscribe uses 2 byte wide characters, and therefore 😉 is treated as two characters by uniscribe whereas Python 3 treats it as one.
This is where textUtils.WideStringOffsetConverter comes into view. This new class keeps the decoded and encoded form of a string in one object. This object can be used to convert string offsets between two implementations, namely the Python 3 one offset per code point implementation, and the Windows wide character implementation with surrogate offsets.