unstructured
b1e4b009 - Preserve newlines in Table and TableChunk elements during PDF partitioning (#4214)

Commit
40 days ago
Preserve newlines in Table and TableChunk elements during PDF partitioning (#4214) Closes #3983 --- ## Summary This PR fixes an issue where newline characters were being incorrectly stripped from `Table` and `TableChunk` elements during PDF partitioning. The `RE_MULTISPACE_INCLUDING_NEWLINES` regex was being applied indiscriminately to all `Text` elements, including tables, which removed newlines that carry structural meaning (such as row separation). ## Changes - **`unstructured/partition/pdf.py`**: Added conditional logic to skip whitespace normalization for `Table` and `TableChunk` elements, preserving newlines that convey tabular structure - **`CHANGELOG.md`**: Added entry documenting the fix - **`unstructured/__version__.py`**: Version bump to 0.18.33 ## Problem When processing PDFs (especially image-based PDFs with tables), the code applied this regex substitution to all `Text` elements: ```python el.text = re.sub(RE_MULTISPACE_INCLUDING_NEWLINES, " ", el.text or "").strip() ``` This stripped meaningful line breaks from table content, degrading the structural representation of tabular data. ## Solution Added a check to exclude `Table` and `TableChunk` elements from the whitespace normalization: ```python # Skip newline normalization for Table/TableChunk - newlines carry structural meaning if not isinstance(el, (Table, TableChunk)): el.text = re.sub( RE_MULTISPACE_INCLUDING_NEWLINES, " ", el.text or "", ).strip() ``` --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Author
Parents
Loading