unstructured
d4caedf0 - fix: Preserve Line Breaks in Code Blocks During Chunking (#4196)

Commit
42 days ago
fix: Preserve Line Breaks in Code Blocks During Chunking (#4196) ## Problem When using `chunk_elements()` on markdown files containing code blocks, line breaks within the code were being discarded, resulting in unreadable code: Fixes #4095 ```python # Before fix - code becomes flattened: "def hello(): print('Hello') return True" # Expected - preserve formatting: "def hello():\n print('Hello')\n return True" ``` ## Root Cause Two issues were identified: 1. **HTML Parser**: `<pre>` elements generated generic `Text` elements instead of `CodeSnippet` elements 2. **Chunking**: The `_iter_text_segments()` method normalized all whitespace to single spaces, destroying newlines ## Solution ### 1. HTML Parser Change (`unstructured/partition/html/parser.py`) Made `<pre>` elements generate `CodeSnippet` elements: ```python class Pre(BlockItem): """Custom element-class for `<pre>` element. Can only contain phrasing content. Generates CodeSnippet elements to preserve code formatting including whitespace and line breaks. """ _ElementCls = CodeSnippet # Added this line ``` ### 2. Chunking Change (`unstructured/chunking/base.py`) Modified `_iter_text_segments()` to preserve whitespace for `CodeSnippet` elements: ```python def _iter_text_segments(self) -> Iterator[str]: """Generate overlap text and each element text segment in order. Empty text segments are not included. CodeSnippet elements preserve their original whitespace (including newlines) to maintain code formatting. """ if self._overlap_prefix: yield self._overlap_prefix for e in self._elements: if e.text and len(e.text): # -- preserve whitespace for code snippets to maintain formatting -- if isinstance(e, CodeSnippet): text = e.text.strip() else: text = " ".join(e.text.strip().split()) if text: yield text ``` ## Files Changed | File | Change | |------|--------| | `unstructured/partition/html/parser.py` | Added `CodeSnippet` import, set `_ElementCls = CodeSnippet` in `Pre` class | | `unstructured/chunking/base.py` | Added `CodeSnippet` import, special handling in `_iter_text_segments()` | | `test_unstructured/partition/html/test_parser.py` | Added test for `CodeSnippet` generation, updated existing test | | `test_unstructured/chunking/test_base.py` | Added 2 tests for whitespace preservation | Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461
Author
Parents
Loading