unstructured
aef3bc4d - fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode (#4266)

Commit
14 days ago
fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode (#4266) ## Problem `_patch_current_chars_with_render_mode` is called on every `do_TJ`/`do_Tj` text operator during PDF parsing. The original implementation re-scans the entire `cur_item._objs` list each time, checking `hasattr(item, "rendermode")` to skip already-patched items. For a page with N characters across M text operations, this is O(N*M) — effectively quadratic. Memray profiling showed this function as the #1 allocator: 17.57 GB total across 549M allocations in a session processing just 4 files. ## Fix Track the last-patched index so each call only processes newly-added `LTChar` objects. Reset automatically when `cur_item` changes (new page or figure). **Before:** O(N²) per page — re-scans all accumulated objects on every text operator **After:** O(N) per page — each object visited exactly once --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io>
Author
Parents
Loading