unstructured
0b73978b - fix: fix `IndexError` when partioning a pdf with `starting_page_number` (#3246)

Commit
1 year ago
fix: fix `IndexError` when partioning a pdf with `starting_page_number` (#3246) The Issue: When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via `starting_page_number`. To get the true page index, we need to subtract this value. Testing: Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements. ``` from unstructured.partition.auto import partition filename = "example-docs/layout-parser-paper-with-table.pdf" partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20) ``` --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>
Author
Parents
Loading