feat: support pdf link extraction in hi_res strategy #3753
refactor: organize PDF link extraction functions in `fast` strategy
e713ef65
move url_metadata related functions from pdf.py to pdfminer_processin…
c010c206
add ability to get urls_metadata from process_data_with_pdfminer
ef718e20
return layouts_urls_metadata in process_file_with_pdfminer()
7a863bff
add parameter `layouts_urls_metadata` to `process_file_with_pdfminer()`
485827e0
update logic to get layouts url metadata
7223a132
add elements links using _get_links_in_element()
881d6fcf
Merge branch 'main' into feat/support-link-extraction-in-pdf-hi_res
12b53e5f
test: fix lint errors
18d670cd
update changelog.md and version.py
abfd8036
remove unnecessary `map` function
6b443703
fix import error
7e8ad86a
test: fix lint errors
9a3b59ff
fix list index out of range
2ad76f8e
fix: list index out of range
0445006a
test: add unit test for hi_res link extraction
a656b226
refactor: move document_to_element_list from common module to pdf module
406b9d3e
feat: support pdf link extraction in hi_res strategy <- Ingest test f…
9cf2f814
chore: release version 0.16.4
c2cfd669
refactor: rename `get_word_bounding_box_from_element()` to `get_words…
d2332ca9
feat: enhance word extraction from PDFMiner objects
6fa0c09d
feat: support pdf link extraction in hi_res strategy <- Ingest test f…
c5772f31
ci: add envs for astradb credentials to ingest-test-fixtures-update-p…
89c97d21
feat: support pdf link extraction in hi_res strategy <- Ingest test f…
7cc86ba0
Merge branch 'main' into feat/support-link-extraction-in-pdf-hi_res
909af5ac
deps: pin unstructured-ingest to 0.2.1
ae94895f
Merge branch 'refs/heads/main' into feat/support-link-extraction-in-p…
29fe122c
cragwolfe
approved these changes
on 2024-10-31
christinestraub
deleted the feat/support-link-extraction-in-pdf-hi_res branch 1 year ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub