unstructured
dedf1441 - fix: keep extracted text aligned with rotated PDF page images in hi_res (#4367)

Commit
13 days ago
fix: keep extracted text aligned with rotated PDF page images in hi_res (#4367) ## Problem On PDF pages with a non-zero `/Rotate`, the hi_res object-detection layer (which runs on the rendered page image) and the pdfminer-extracted text layer could end up in different coordinate frames, off by the page rotation. The merge then placed extracted text in the wrong locations, scattering it across the output. ## Fix `unstructured-inference` may rotate a rendered page image to make its dominant text upright and reports that angle as `pdf_rotation_correction` in the page image metadata. This change mirrors that same rotation onto the pdfminer-extracted coordinates so both layers share one coordinate frame and merge correctly. - `_rotate_bboxes` rotates bounding boxes to match PIL's `rotate(angle, expand=True)`. - `process_data_with_pdfminer` / `process_file_with_pdfminer` accept a per-page `rotation_corrections` list and apply it to element coordinates and link bounding boxes. - `partition_pdf` reads `pdf_rotation_correction` from the inferred layout's image metadata and threads it into the pdfminer pass. `unstructured` performs no orientation detection of its own — it simply mirrors the correction reported by the renderer, so the two layers stay aligned by construction. ## Tests - Added a unit test for `_rotate_bboxes` covering the 0/90/180/270 directions, round-trip, and bbox validity. - Existing pdfminer processing tests pass unchanged. Requires the paired `unstructured-inference` change that emits `pdf_rotation_correction`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Keeps extracted text aligned with rotated PDF page images in hi_res by mirroring the renderer’s rotation onto `pdfminer` coordinates, fixing scattered text on pages with non-zero `/Rotate`. - **Bug Fixes** - Thread per-page `rotation_corrections` (from `pdf_rotation_correction` image metadata) through `partition_pdf` into `process_*_with_pdfminer` (file and data paths); rotate element coords and link bboxes to mirror PIL `rotate(angle, expand=True)`. - Add `_rotation_corrections_from_layout` and `_rotate_bboxes` with tests covering default-to-0, pass-through into pdfminer, and 0/90/180/270 rotation behavior. - **Dependencies** - Bump `unstructured-inference` minimum to `>=1.6.12` which emits `pdf_rotation_correction`. <sup>Written for commit 6e7f1221a99c587451b0c80cf30b4b544e74757a. Summary will update on new commits.</sup> <a href="https://cubic.dev/pr/Unstructured-IO/unstructured/pull/4367?utm_source=github" target="_blank" rel="noopener noreferrer" data-no-image-dialog="true"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cubic.dev/buttons/review-in-cubic-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cubic.dev/buttons/review-in-cubic-light.svg"><img alt="Review in cubic" src="https://cubic.dev/buttons/review-in-cubic-dark.svg"></picture></a> <!-- End of auto-generated description by cubic. --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
Parents
Loading