fix: Preserve Line Breaks in Code Blocks During Chunking (#4196)
## Problem
When using `chunk_elements()` on markdown files containing code blocks,
line breaks within the code were being discarded, resulting in
unreadable code:
Fixes #4095
```python
# Before fix - code becomes flattened:
"def hello(): print('Hello') return True"
# Expected - preserve formatting:
"def hello():\n print('Hello')\n return True"
```
## Root Cause
Two issues were identified:
1. **HTML Parser**: `<pre>` elements generated generic `Text` elements
instead of `CodeSnippet` elements
2. **Chunking**: The `_iter_text_segments()` method normalized all
whitespace to single spaces, destroying newlines
## Solution
### 1. HTML Parser Change (`unstructured/partition/html/parser.py`)
Made `<pre>` elements generate `CodeSnippet` elements:
```python
class Pre(BlockItem):
"""Custom element-class for `<pre>` element.
Can only contain phrasing content. Generates CodeSnippet elements to preserve
code formatting including whitespace and line breaks.
"""
_ElementCls = CodeSnippet # Added this line
```
### 2. Chunking Change (`unstructured/chunking/base.py`)
Modified `_iter_text_segments()` to preserve whitespace for
`CodeSnippet` elements:
```python
def _iter_text_segments(self) -> Iterator[str]:
"""Generate overlap text and each element text segment in order.
Empty text segments are not included. CodeSnippet elements preserve their
original whitespace (including newlines) to maintain code formatting.
"""
if self._overlap_prefix:
yield self._overlap_prefix
for e in self._elements:
if e.text and len(e.text):
# -- preserve whitespace for code snippets to maintain formatting --
if isinstance(e, CodeSnippet):
text = e.text.strip()
else:
text = " ".join(e.text.strip().split())
if text:
yield text
```
## Files Changed
| File | Change |
|------|--------|
| `unstructured/partition/html/parser.py` | Added `CodeSnippet` import,
set `_ElementCls = CodeSnippet` in `Pre` class |
| `unstructured/chunking/base.py` | Added `CodeSnippet` import, special
handling in `_iter_text_segments()` |
| `test_unstructured/partition/html/test_parser.py` | Added test for
`CodeSnippet` generation, updated existing test |
| `test_unstructured/chunking/test_base.py` | Added 2 tests for
whitespace preservation |
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=42954461