unstructured
feat: support pdf link extraction in hi_res strategy
#3753

Merged

feat: support pdf link extraction in hi_res strategy #3753

christinestraub merged 27 commits into main from feat/support-link-extraction-in-pdf-hi_res

refactor: organize PDF link extraction functions in `fast` strategy

e713ef65

move url_metadata related functions from pdf.py to pdfminer_processin…

c010c206

add ability to get urls_metadata from process_data_with_pdfminer

ef718e20

return layouts_urls_metadata in process_file_with_pdfminer()

7a863bff

add parameter `layouts_urls_metadata` to `process_file_with_pdfminer()`

485827e0

update logic to get layouts url metadata

7223a132

add elements links using _get_links_in_element()

881d6fcf

christinestraub force pushed from 48d0233c to 881d6fcf 1 year ago

Merge branch 'main' into feat/support-link-extraction-in-pdf-hi_res

12b53e5f

test: fix lint errors

18d670cd

update changelog.md and version.py

abfd8036

remove unnecessary `map` function

6b443703

fix import error

7e8ad86a

test: fix lint errors

9a3b59ff

fix list index out of range

2ad76f8e

fix: list index out of range

0445006a

test: add unit test for hi_res link extraction

a656b226

refactor: move document_to_element_list from common module to pdf module

406b9d3e

feat: support pdf link extraction in hi_res strategy <- Ingest test f…

9cf2f814

christinestraub marked this pull request as ready for review 1 year ago

christinestraub requested a review from

badGarnet 1 year ago

christinestraub requested a review from

cragwolfe 1 year ago

cragwolfe commented on 2024-10-29

chore: release version 0.16.4

c2cfd669

refactor: rename `get_word_bounding_box_from_element()` to `get_words…

d2332ca9

feat: enhance word extraction from PDFMiner objects

6fa0c09d

feat: support pdf link extraction in hi_res strategy <- Ingest test f…

c5772f31

christinestraub requested a review from

cragwolfe 1 year ago

ci: add envs for astradb credentials to ingest-test-fixtures-update-p…

89c97d21

feat: support pdf link extraction in hi_res strategy <- Ingest test f…

7cc86ba0

christinestraub requested a review from

ryannikolaidis 1 year ago

Merge branch 'main' into feat/support-link-extraction-in-pdf-hi_res

909af5ac

deps: pin unstructured-ingest to 0.2.1

ae94895f

christinestraub enabled auto-merge 1 year ago

Merge branch 'refs/heads/main' into feat/support-link-extraction-in-p…

29fe122c

cragwolfe approved these changes on 2024-10-31

christinestraub merged df156ebe into main 1 year ago

christinestraub deleted the feat/support-link-extraction-in-pdf-hi_res branch 1 year ago

Reviewers

cragwolfe

badGarnet

ryannikolaidis

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

unstructured feat: support pdf link extraction in hi_res strategy #3753 Merged

feat: support pdf link extraction in hi_res strategy #3753

unstructured
feat: support pdf link extraction in hi_res strategy
#3753

Merged