unstructured
a589a494 - docx: improve page break fidelity (#1631)

Commit
2 years ago
docx: improve page break fidelity (#1631) Page breaks can and often do occur within a paragraph. The full text of the paragraph is attributed to the page (number) the paragraph starts on. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the `PageBreak` element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements. This functionality is largely provided upstream by the new `python-docx` v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support). That version also makes obsolete the "include hyperlink text in `Paragraph.text` monkey patch that we had maintained up to now. Remove that monkey-patch.
Author
Parents
Loading