fix: keep extracted text aligned with rotated PDF page images in hi_res (#4367)
## Problem
On PDF pages with a non-zero `/Rotate`, the hi_res object-detection
layer (which runs on the rendered page image) and the pdfminer-extracted
text layer could end up in different coordinate frames, off by the page
rotation. The merge then placed extracted text in the wrong locations,
scattering it across the output.
## Fix
`unstructured-inference` may rotate a rendered page image to make its
dominant text upright and reports that angle as
`pdf_rotation_correction` in the page image metadata. This change
mirrors that same rotation onto the pdfminer-extracted coordinates so
both layers share one coordinate frame and merge correctly.
- `_rotate_bboxes` rotates bounding boxes to match PIL's `rotate(angle,
expand=True)`.
- `process_data_with_pdfminer` / `process_file_with_pdfminer` accept a
per-page `rotation_corrections` list and apply it to element coordinates
and link bounding boxes.
- `partition_pdf` reads `pdf_rotation_correction` from the inferred
layout's image metadata and threads it into the pdfminer pass.
`unstructured` performs no orientation detection of its own — it simply
mirrors the correction reported by the renderer, so the two layers stay
aligned by construction.
## Tests
- Added a unit test for `_rotate_bboxes` covering the 0/90/180/270
directions, round-trip, and bbox validity.
- Existing pdfminer processing tests pass unchanged.
Requires the paired `unstructured-inference` change that emits
`pdf_rotation_correction`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Keeps extracted text aligned with rotated PDF page images in hi_res by
mirroring the renderer’s rotation onto `pdfminer` coordinates, fixing
scattered text on pages with non-zero `/Rotate`.
- **Bug Fixes**
- Thread per-page `rotation_corrections` (from `pdf_rotation_correction`
image metadata) through `partition_pdf` into `process_*_with_pdfminer`
(file and data paths); rotate element coords and link bboxes to mirror
PIL `rotate(angle, expand=True)`.
- Add `_rotation_corrections_from_layout` and `_rotate_bboxes` with
tests covering default-to-0, pass-through into pdfminer, and
0/90/180/270 rotation behavior.
- **Dependencies**
- Bump `unstructured-inference` minimum to `>=1.6.12` which emits
`pdf_rotation_correction`.
<sup>Written for commit 6e7f1221a99c587451b0c80cf30b4b544e74757a.
Summary will update on new commits.</sup>
<a
href="https://cubic.dev/pr/Unstructured-IO/unstructured/pull/4367?utm_source=github"
target="_blank" rel="noopener noreferrer"
data-no-image-dialog="true"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://cubic.dev/buttons/review-in-cubic-dark.svg"><source
media="(prefers-color-scheme: light)"
srcset="https://cubic.dev/buttons/review-in-cubic-light.svg"><img
alt="Review in cubic"
src="https://cubic.dev/buttons/review-in-cubic-dark.svg"></picture></a>
<!-- End of auto-generated description by cubic. -->
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>