unstructured
c1f819c5 - fix: gracefully handle invalide html string during chunking (#4243)

Commit
21 days ago
fix: gracefully handle invalide html string during chunking (#4243) This PR fixes an issue where an invalid `text_as_html` input into html based table chunking logic can lead to chunking failing. Like the following stack trace shows: ``` | File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks | yield from _TableChunker.iter_chunks( | File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks | html_size = measure(self._html) if self._html else 0 | ^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html | if not (html_table := self._html_table): | ^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table | return HtmlTable.from_html_text(text_as_html) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text | root = fragment_fromstring(html_text) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring | elements = fragments_fromstring( | ^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring | raise etree.ParserError( | lxml.etree.ParserError: There is leading text: '```html\n' ``` The solution is to catch the parser error and return a `None` instead in `unstructured/chunking/base.py` in `_html_table`. This way we fallback to text based chunking for this element with a warning log.
Author
Parents
Loading