unstructured
615782a4 - fix(chunking): preserve semantic headers in carried table chunks (#4313)

Commit
1 day ago
fix(chunking): preserve semantic headers in carried table chunks (#4313) ## Why It Matters When a large table is split across multiple chunks, continuation chunks can lose the header context that makes the body rows understandable. That hurts downstream retrieval and makes reconstructed tables harder to interpret. This PR preserves **semantic table headers** on continuation chunks while keeping the existing compactified table behavior for ordinary body rows. It also ensures repeated headers can be removed during reconstruction so merged tables do not duplicate header text. ## What Changed ### Preserve source header row structure for carried headers `HtmlTable` now keeps the original `<tr>` HTML for each top-level row before compactification. This allows continuation chunks to reuse source header markup instead of rebuilding headers from the compactified representation. ### Repeat semantic headers on continuation chunks When `repeat_table_headers=True` and a table is split: * leading header rows are detected from either: * rows inside `<thead>`, or * contiguous leading rows containing `<th>` * continuation chunks repeat those rows inside a `<thead>` * direct child `<td>` cells in repeated header rows are converted to `<th>` * attributes and nested markup in header cells are preserved ### Keep retrieval-oriented `.text` behavior intact Continuation chunk `.text` includes the **textual** content of repeated header rows so downstream retrieval keeps header context. At the same time: * non-text-only header cells are **not** introduced into `.text` * those cells are still preserved in `metadata.text_as_html` ### Reconstruct merged tables without duplicated headers `reconstruct_table_from_chunks()` now: * strips synthetic carried-header text from continuation chunk `.text` * rebuilds a single canonical `<thead>` when repeated header rows are present * only synthesizes that canonical header when the carried rows match the leading rows of chunk 0 * preserves header attributes and nested markup during reconstruction ### Preserve existing non-header behavior This change is intentionally narrow: * compactified table/body HTML remains unchanged for non-header rows * ordinary body cells still use the compactified representation * `repeat_table_headers=False` preserves the legacy non-repeating behavior ## Behavior Summary With header repetition enabled: * the first chunk is unchanged * continuation chunks repeat semantic header rows in HTML * continuation chunks prepend only **textual** header content to `.text` * reconstructed tables remove synthetic repeats and recover a single header section With header repetition disabled: * chunking behavior matches the previous implementation ## Test Coverage Added and expanded tests for: * detection of contiguous leading header rows * preservation of `<thead>` / `<th>` semantics on carried headers * preservation of header attributes and nested markup * preservation of non-text-only header cells in HTML without polluting `.text` * exact-fit and near-boundary continuation behavior * fallback behavior for pathologically large headers * reconstruction of repeated-header tables without duplication * canonical `<thead>` reconstruction * reconstruction guards for mismatched carried-header rows * source row HTML preservation in `HtmlTable`
Author
Parents
Loading