unstructured
e1f75a39 - Improve fast partition cold start (#4242)

Commit
22 days ago
Improve fast partition cold start (#4242) Improve PDF fast strategy cold-start latency by lazy-loading hi-res-only imports in [pdf.py](https://github.com/Unstructured-IO/unstructured/blob/1c3d5e6ef7b6123a2d8739bf9a8c3afecc3dd127/unstructured/partition/pdf.py). This reduces first-call startup overhead without changing partition behavior. Local benchmarks show a significant fast strategy cold-start speedup of ~35% from 2.75s -> 1.78s. They also show a small hi_res slowdown (~2-4%), which is acceptable given the fast improvements. Benchmark was run on 6 pdfs https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/DA-1p.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/chevron-page.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/embedded-images-tables.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/fake-memo-with-duplicate-page.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/interface-config-guide-p93.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/layout-parser-paper-fast.pdf <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches core PDF partitioning by changing import timing and locations; behavior should be unchanged but there is some risk of missed/conditional imports causing runtime errors in less-tested hi_res/OCR/analysis paths. > > **Overview** > Improves PDF `fast` strategy cold-start performance by **lazy-loading hi-res-only dependencies** in `unstructured/partition/pdf.py` (moving several `pdf_image`/`unstructured_inference`-related imports into `_partition_pdf_or_image_local` and other hi-res/OCR-only code paths), while keeping the `fast` path lighter. > > Adds `scripts/performance/quick_partition_bench.py` for quick local cold vs warm partition timing across one or more PDFs, updates the table metrics helper to import `convert_pdf_to_images` from `pdf_image_utils`, and bumps the library version to `0.20.4` with corresponding changelog entry. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b66ae0e81ec30ad0910631d78c3dec12f1320a38. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Author
Parents
Loading