nvda
f1c884d1 - Python 3: When processing xml coming from virtual buffers, make sure that surrogate characters are handled properly (PR #9897)

Commit

6 years ago

Python 3: When processing xml coming from virtual buffers, make sure that surrogate characters are handled properly (PR #9897) Before this change, in virtual buffers, emoji are shown as two surrogate characters instead of the actual emoji character. This is because virtual buffers output in UTF-16, and surrogate characters are not allowed in xml. Therefore, the XML contains integer values for the surrogate characters, and they are converted to real characters in the xmlFormatting module. As surrogate characters are perfectly valid within Python 3 and even a surrogate pair is allowed, Python 3 does not collapse two surrogate characters into one 32-bit character. If a low surrogate character is handled, add it to the currently buffered text, and quickly encode and decode the text to/from UTF-16. This ensures that surrogate pairs are properly decoded to the associated 32-bit character. The decision to do this for low surrogates only is intentional. A high surrogate is usually followed by a low surrogate. Therefore it doesn't make sense to re-encode if processing a high surrogate. We also want to avoid just re-encoding anything.

References

#9897 - Python 3: When processing xml coming from virtual buffers, make sure that surrogate characters are handled properly

Author

LeonarddeR

Committer

feerrenrut

Parents

92be7aa6

nvda f1c884d1 - Python 3: When processing xml coming from virtual buffers, make sure that surrogate characters are handled properly (PR #9897)

nvda
f1c884d1 - Python 3: When processing xml coming from virtual buffers, make sure that surrogate characters are handled properly (PR #9897)