unstructured
03e0ed35 - rfctr(docx): DOCX emits std minified .text_as_html (#3545)

Commit
1 year ago
rfctr(docx): DOCX emits std minified .text_as_html (#3545) **Summary** Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_docx()`. Produce minified `.text_as_html` consistent with that formed by chunking. **Additional Context** - nested tables appear as their extracted text in the parent cell (no nested `<table>` elements in `.text_as_html`). - DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements).
Author
Parents
Loading