fix: gracefully handle invalide html string during chunking (#4243)
This PR fixes an issue where an invalid `text_as_html` input into html
based table chunking logic can lead to chunking failing. Like the
following stack trace shows:
```
| File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks
| yield from _TableChunker.iter_chunks(
| File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks
| html_size = measure(self._html) if self._html else 0
| ^^^^^^^^^^
| File "/app/unstructured/unstructured/utils.py", line 154, in __get__
| value = self._fget(obj)
| ^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html
| if not (html_table := self._html_table):
| ^^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/utils.py", line 154, in __get__
| value = self._fget(obj)
| ^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table
| return HtmlTable.from_html_text(text_as_html)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text
| root = fragment_fromstring(html_text)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring
| elements = fragments_fromstring(
| ^^^^^^^^^^^^^^^^^^^^^
| File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring
| raise etree.ParserError(
| lxml.etree.ParserError: There is leading text: '```html\n'
```
The solution is to catch the parser error and return a `None` instead in
`unstructured/chunking/base.py` in `_html_table`. This way we fallback
to text based chunking for this element with a warning log.