fix(chunking): preserve semantic headers in carried table chunks (#4313)
## Why It Matters
When a large table is split across multiple chunks, continuation chunks
can lose the header context that makes the body rows understandable.
That hurts downstream retrieval and makes reconstructed tables harder to
interpret.
This PR preserves **semantic table headers** on continuation chunks
while keeping the existing compactified table behavior for ordinary body
rows. It also ensures repeated headers can be removed during
reconstruction so merged tables do not duplicate header text.
## What Changed
### Preserve source header row structure for carried headers
`HtmlTable` now keeps the original `<tr>` HTML for each top-level row
before compactification. This allows continuation chunks to reuse source
header markup instead of rebuilding headers from the compactified
representation.
### Repeat semantic headers on continuation chunks
When `repeat_table_headers=True` and a table is split:
* leading header rows are detected from either:
* rows inside `<thead>`, or
* contiguous leading rows containing `<th>`
* continuation chunks repeat those rows inside a `<thead>`
* direct child `<td>` cells in repeated header rows are converted to
`<th>`
* attributes and nested markup in header cells are preserved
### Keep retrieval-oriented `.text` behavior intact
Continuation chunk `.text` includes the **textual** content of repeated
header rows so downstream retrieval keeps header context.
At the same time:
* non-text-only header cells are **not** introduced into `.text`
* those cells are still preserved in `metadata.text_as_html`
### Reconstruct merged tables without duplicated headers
`reconstruct_table_from_chunks()` now:
* strips synthetic carried-header text from continuation chunk `.text`
* rebuilds a single canonical `<thead>` when repeated header rows are
present
* only synthesizes that canonical header when the carried rows match the
leading rows of chunk 0
* preserves header attributes and nested markup during reconstruction
### Preserve existing non-header behavior
This change is intentionally narrow:
* compactified table/body HTML remains unchanged for non-header rows
* ordinary body cells still use the compactified representation
* `repeat_table_headers=False` preserves the legacy non-repeating
behavior
## Behavior Summary
With header repetition enabled:
* the first chunk is unchanged
* continuation chunks repeat semantic header rows in HTML
* continuation chunks prepend only **textual** header content to `.text`
* reconstructed tables remove synthetic repeats and recover a single
header section
With header repetition disabled:
* chunking behavior matches the previous implementation
## Test Coverage
Added and expanded tests for:
* detection of contiguous leading header rows
* preservation of `<thead>` / `<th>` semantics on carried headers
* preservation of header attributes and nested markup
* preservation of non-text-only header cells in HTML without polluting
`.text`
* exact-fit and near-boundary continuation behavior
* fallback behavior for pathologically large headers
* reconstruction of repeated-header tables without duplication
* canonical `<thead>` reconstruction
* reconstruction guards for mismatched carried-header rows
* source row HTML preservation in `HtmlTable`