unstructured
refactor: `partition_pdf()` for `ocr_only` strategy
#1811
Merged

refactor: `partition_pdf()` for `ocr_only` strategy #1811

christinestraub
christinestraub feat: update `ocr_only` strategy related code using `process_file_wit…
4cdc559f
christinestraub feat: add functionality to get layout elements from ocr regions (`ocr…
d9c5f68f
christinestraub feat: update `merge_out_layout_with_ocr_layout()` to perform grouping…
9eb10480
christinestraub refactor: renaming...
3e111fbb
christinestraub refactor: organization
b59ff5f6
christinestraub refactor: revert renaming
8bf8b4ba
christinestraub refactor: combine `get_ocr_layout_from_image()` and `get_ocr_text_fro…
c1a4d571
christinestraub feat:
e49ab0e6
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
2e2c4a19
christinestraub feat: add an `Enum` for OCR sources
babd2ba1
christinestraub feat: add functionality to get layout elements from ocr regions for `…
d3cbbe06
christinestraub feat: add functionality to get `source` when merging text regions
49ef5553
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
a5af33be
christinestraub refactor: minor changes
2b70b080
christinestraub refactor: separate `ocr_only` path from `hi_res` path
bee455f8
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
51bcd4e9
christinestraub refactor: update `get_ocr_data_from_image` to reflect changes in the …
30e7661a
christinestraub test: fix lint errors
9e97cdc9
christinestraub refactor: rename `entire_page_ocr` to `ocr_agent`
ed7c5998
christinestraub chore: update required dependencies for `_partition_pdf_or_image_with…
d566de7b
christinestraub feat: add constants for OCR agents
386b8a77
christinestraub refactor: update ocr test cases
e026a960
christinestraub chore: update changelog & version
7fa71cc3
christinestraub christinestraub marked this pull request as ready for review 2 years ago
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
38dc3a41
christinestraub christinestraub requested a review from yuming-long yuming-long 2 years ago
christinestraub christinestraub requested a review from qued qued 2 years ago
christinestraub christinestraub requested a review from cragwolfe cragwolfe 2 years ago
christinestraub chore: update changelog
beb1980a
yuming-long
yuming-long commented on 2023-10-24
yuming-long
yuming-long commented on 2023-10-24
yuming-long
yuming-long commented on 2023-10-24
christinestraub test: fix unit test errors
8ec73c59
christinestraub feat: set `languages` metadata field
3cf72eb9
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
3988b3cd
christinestraub chore: add docstring to `get_ocr_data_from_image()`
e327db72
christinestraub feat: revert setting `languages` metadata field for `hi_res` strategy
34fe9641
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
d21b28ea
christinestraub chore: fix lint errors
4c7ef536
christinestraub christinestraub requested a review from yuming-long yuming-long 2 years ago
christinestraub chore: add notes to `get_page_layout_from_ocr()`
3b57a384
christinestraub test: fix unit test errors
546e6186
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
8e70433d
christinestraub chore: update version
71468700
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
1f6ec4b1
ryannikolaidis refactor: `partition_pdf()` for `ocr_only` strategy <- Ingest test fi…
d258bd54
yuming-long
yuming-long commented on 2023-10-25
qued
qued commented on 2023-10-25
qued
qued commented on 2023-10-25
qued
qued commented on 2023-10-25
qued
qued dismissed these changes on 2023-10-25
christinestraub feat: udpate page layout element type by `element_from_text()`
85f8d42a
christinestraub feat: disable sorting for `tesseract`
7d8edaaa
christinestraub feat: update natural reading order evaluation script to skip drawing …
fcd55eca
christinestraub refactor: reduce dependency on `unstructured-inference` format
82eca62e
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
d8c51181
christinestraub chore: update version
a306a130
christinestraub chore: update changelog & version
ebb7adfe
christinestraub feat: move `OCR_AGENT` to environment config & add utility function `…
d519ead6
christinestraub test: update test case
22af55b6
ryannikolaidis refactor: `partition_pdf()` for `ocr_only` strategy <- Ingest test fi…
1446d27a
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
edbe55c9
christinestraub chore: update version
22af35e2
ryannikolaidis refactor: `partition_pdf()` for `ocr_only` strategy <- Ingest test fi…
8ab28481
christinestraub christinestraub requested a review from yuming-long yuming-long 2 years ago
yuming-long
yuming-long approved these changes on 2023-10-26
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
3e397069
christinestraub chore: update changelog & version
03f4bb2a
christinestraub refactor: revert combining `get_ocr_layout_from_image()` and `get_ocr…
ad30d5fd
christinestraub feat: remove bad `detection_origin`
d49d935c
christinestraub test: add test cases
9c5ee2fc
christinestraub christinestraub requested a review from qued qued 2 years ago
christinestraub refactor: renaming...
fa294e85
christinestraub feat: update `_ocr_data_to_elements` to return "UncategorizedText" el…
c0259267
christinestraub refactor: merge test functions
99234342
christinestraub test: update test cases
93b67f55
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
023588a3
yuming-long
yuming-long commented on 2023-10-27
yuming-long
yuming-long commented on 2023-10-27
christinestraub test: add test cases for xycut.py
90e9f732
christinestraub refactor: remove unused functions
719c0a59
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
bc340034
christinestraub chore: update version
80bfe471
christinestraub test: fix lint errors
47bce397
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
13f4967d
christinestraub chore: update changelog & version
36477ec3
christinestraub chore: update log messages
a476c7b9
christinestraub refactor: renaming...
be5c79ec
christinestraub christinestraub enabled auto-merge 2 years ago
disabled auto-merge 2 years ago
Manually disabled by user
christinestraub
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
f385bb87
christinestraub Merge branch 'main' into refactor/partition-pdf-ocr_only
4b75f41c
christinestraub refactor: merge multiple test functions related to "ocr_only" strateg…
026028ae
christinestraub test: update test functions
fc0e9ab1
christinestraub chore: update changelog & version
16b3b125
christinestraub test: update test function
4beca658
christinestraub chore: fix version
6f4fc014
christinestraub test: fix lint errors
a5f3ba3c
cragwolfe
cragwolfe approved these changes on 2023-10-30
christinestraub christinestraub dismissed their stale review 2 years ago
already addressed
christinestraub christinestraub merged 1f0c563e into main 2 years ago
christinestraub christinestraub deleted the refactor/partition-pdf-ocr_only branch 2 years ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone