Preserve newlines in Table and TableChunk elements during PDF partitioning (#4214)
Closes #3983
---
## Summary
This PR fixes an issue where newline characters were being incorrectly
stripped from `Table` and `TableChunk` elements during PDF partitioning.
The `RE_MULTISPACE_INCLUDING_NEWLINES` regex was being applied
indiscriminately to all `Text` elements, including tables, which removed
newlines that carry structural meaning (such as row separation).
## Changes
- **`unstructured/partition/pdf.py`**: Added conditional logic to skip
whitespace normalization for `Table` and `TableChunk` elements,
preserving newlines that convey tabular structure
- **`CHANGELOG.md`**: Added entry documenting the fix
- **`unstructured/__version__.py`**: Version bump to 0.18.33
## Problem
When processing PDFs (especially image-based PDFs with tables), the code
applied this regex substitution to all `Text` elements:
```python
el.text = re.sub(RE_MULTISPACE_INCLUDING_NEWLINES, " ", el.text or "").strip()
```
This stripped meaningful line breaks from table content, degrading the
structural representation of tabular data.
## Solution
Added a check to exclude `Table` and `TableChunk` elements from the
whitespace normalization:
```python
# Skip newline normalization for Table/TableChunk - newlines carry structural meaning
if not isinstance(el, (Table, TableChunk)):
el.text = re.sub(
RE_MULTISPACE_INCLUDING_NEWLINES,
" ",
el.text or "",
).strip()
```
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>