nvda
15d8374e - textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters (PR #9545)

Commit

6 years ago

textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters (PR #9545) Closes #8981 On Windows, wide characters are two bytes in size. This is also the case in python 2. This is best explained with an example: Python 2: >>> len(u"😉") 2 In Python 3 however, strings are saved using a variable byte size, based on the number of bytes that is needed to store the highest code point in the string. One index always corresponds with one code point. A much more detailed description of the problem can be found in #8981. This commit introduces a new textUtils module that intends to mitigate issues introduced with the Python 3 transition. Most offset based TextInfos are based on a two bytes wide character string representation. For example, uniscribe uses 2 byte wide characters, and therefore 😉 is treated as two characters by uniscribe whereas Python 3 treats it as one. This is where textUtils.WideStringOffsetConverter comes into view. This new class keeps the decoded and encoded form of a string in one object. This object can be used to convert string offsets between two implementations, namely the Python 3 one offset per code point implementation, and the Windows wide character implementation with surrogate offsets.

References

#9545 - textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters

Author

LeonarddeR

Committer

feerrenrut

Parents

bb126125

nvda 15d8374e - textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters (PR #9545)

nvda
15d8374e - textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters (PR #9545)