unstructured
feat: clean pdfminer elements inside tables
#1808
Merged

feat: clean pdfminer elements inside tables #1808

benjats07
feat: clean pdf miner elements inside tables
d9dce345
benjats07 benjats07 changed the title feat: clean pdf miner elements inside tables feat: clean pdfminer elements inside tables 2 years ago
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
01a3a064
feat: generate_extra_info param added to partition_pdf
08286a8a
linting
0177dc5d
Changelog update
a03e0d38
refactor: changes location of clean_pdfminer_inner_elements
26d7cf28
rbiseck3 Add local connector metadata and fix deserialization
4844abeb
rbiseck3 update changelog
6b8a1400
rbiseck3 move custom logic to from_dict rather than from_json
23839784
rbiseck3 Add test cases to unit test
8533005f
rbiseck3 Refactor unit test to assert entire doc equality
346ff269
rbiseck3 Don't make call to get metadata if it doesn't exist at the time a doc…
1ece28da
rbiseck3 Add unit test to validate the lack of meta on serialized content if i…
b1606422
rbiseck3 Move custom serialization down a level to to_dict()
de897260
rbiseck3 Debug ingest update CI job
8796d0f4
rbiseck3 Debug ingest update CI job
8f521190
ryannikolaidis local connector metadata and deserialization fix <- Ingest test fixtu…
f503c06f
rbiseck3 bugfix/mapping source connectors in destination cli commands (#1788)
def7c4b2
rbiseck3 update changelog
112e71a2
rbiseck3 Add new metadata to ignore in local ingest tests
ddb53210
ryannikolaidis local connector metadata and deserialization fix <- Ingest test fixtu…
18104c49
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
b7c82288
Merge remote-tracking branch 'origin/roman/local-connector-metadata' …
07c81e56
ryannikolaidis feat: clean pdfminer elements inside tables <- Ingest test fixtures u…
6a8a792e
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
b8d483c4
refactor: changes way elements are removed from pages
c11e7c90
test: add test for clean_pdfminer_inner_elements
ce4ec12f
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
bf89035a
linting
ce6a1fe9
style: add typing to clean_pdfminer_inner_elements
9cb5c87c
fix: add generate_extra_info=False to several partition_pdf calls
a3b9d8d0
test: refactor way of instantiate MockPageLayout
f5e9a247
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
149514fd
fix: add missing argument generate_extra_info
f25dbab6
ryannikolaidis feat: clean pdfminer elements inside tables <- Ingest test fixtures u…
18048c2f
benjats07 benjats07 marked this pull request as ready for review 2 years ago
benjats07 benjats07 requested a review from qued qued 2 years ago
benjats07 benjats07 requested a review from badGarnet badGarnet 2 years ago
benjats07 benjats07 requested a review from MthwRobinson MthwRobinson 2 years ago
benjats07 benjats07 requested a review from scanny scanny 2 years ago
benjats07 benjats07 requested a review from ajjimeno ajjimeno 2 years ago
benjats07 benjats07 enabled auto-merge 2 years ago
scanny
scanny requested changes on 2023-10-24
ajjimeno
ajjimeno commented on 2023-10-24
fix: minor issues in changelog
9fe4de49
refactor: renaming variale for creating dictionary of inner elements
657e59cd
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
2ae517e9
Linting
4975b41d
lint: delete unused imports
df011dff
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
8ccf7b50
scanny
scanny approved these changes on 2023-10-24
cragwolfe
cragwolfe requested changes on 2023-10-26
christinestraub
christinestraub requested changes on 2023-10-26
christinestraub
refactor: clean_pdfminer_inner_elements just removes elements
ba0594a8
test: update output of test
81ee8d66
refactor: removes unused argument
1825505b
fix: add error margin to is_in operation
4cc028fe
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
de1130c0
fix: Misspelled variable name
438f16ad
refactor: improvement on redability
11701768
Linting
436d4d5d
refactor: improvement on redability
c28daf81
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
0a4a4400
Changelog fixes and expected version update
b4964fb3
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
9c03c2eb
benjats07 benjats07 requested a review from christinestraub christinestraub 2 years ago
benjats07
benjats07 benjats07 requested a review from cragwolfe cragwolfe 2 years ago
christinestraub
christinestraub requested changes on 2023-10-27
fix: deletes incorrect detection_origin
be4ef61d
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
412c2606
benjats07 benjats07 enabled auto-merge 2 years ago
benjats07
benjats07 benjats07 requested a review from christinestraub christinestraub 2 years ago
christinestraub
christinestraub approved these changes on 2023-10-27
cragwolfe
cragwolfe commented on 2023-10-27
cragwolfe
cragwolfe commented on 2023-10-27
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
4b757fef
chore: removing changelog entries
68342dc4
chore: recover changelog from main and add info
21b58f94
chore: remove duplicated entry in this branch
dc95256f
chore: remove duplicated entry in this branch
3665cf63
cragwolfe
cragwolfe approved these changes on 2023-10-28
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
5fccd606
tests: added temporary fix to source checking
ad5d9673
test: update origins for tests
d10ebe8f
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
7339b84c
test: update origins for tests
7aeec3ce
Linting
470dd499
ryannikolaidis feat: clean pdfminer elements inside tables <- Ingest test fixtures u…
810d9095
benjats07 Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
d2b35305
benjats07 benjats07 merged 05c3cd1b into main 2 years ago
benjats07 benjats07 deleted the benjamin/feat/clean-pdfminer-inner-elements branch 2 years ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone