unstructured
6ada488f - fix: pdfminer drops extractable text (#4310)

Commit
29 days ago
fix: pdfminer drops extractable text (#4310) <!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Changes pdfminer integration to override CID font/CMap handling and introduces custom stream decoding/parsing, which can affect text extraction behavior and performance on diverse PDFs (mitigated by size/mapping caps). > > **Overview** > Fixes PDFs where **body text was silently dropped** because CIDFonts used an *embedded Encoding CMap stream* that `pdfminer.six` doesn’t resolve. > > Adds a bounded embedded-CMap decoder/parser and wires it in via `CustomPDFCIDFont` + `CustomPDFResourceManager` so `init_pdfminer()` constructs CID fonts with a parsed CMap (including `WMode`), with DoS-oriented caps on decompression and total mappings. > > Updates tests with a new fixture-driven regression for both `FAST` and `HI_RES` strategies plus targeted unit tests for CMap parsing/stream decoding, and bumps version to `0.22.12` with a changelog entry. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4326b15f6c400e81112f894576941d28fb150da7. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Parents
Loading