unstructured
feat: clean pdfminer elements inside tables
#1808
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
68
Changes
View On
GitHub
feat: clean pdfminer elements inside tables
#1808
benjats07
merged 68 commits into
main
from
benjamin/feat/clean-pdfminer-inner-elements
feat: clean pdf miner elements inside tables
d9dce345
benjats07
changed the title
feat: clean pdf miner elements inside tables
feat: clean pdfminer elements inside tables
2 years ago
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
01a3a064
feat: generate_extra_info param added to partition_pdf
08286a8a
linting
0177dc5d
Changelog update
a03e0d38
refactor: changes location of clean_pdfminer_inner_elements
26d7cf28
Add local connector metadata and fix deserialization
4844abeb
update changelog
6b8a1400
move custom logic to from_dict rather than from_json
23839784
Add test cases to unit test
8533005f
Refactor unit test to assert entire doc equality
346ff269
Don't make call to get metadata if it doesn't exist at the time a doc…
1ece28da
Add unit test to validate the lack of meta on serialized content if i…
b1606422
Move custom serialization down a level to to_dict()
de897260
Debug ingest update CI job
8796d0f4
Debug ingest update CI job
8f521190
local connector metadata and deserialization fix <- Ingest test fixtu…
f503c06f
bugfix/mapping source connectors in destination cli commands (#1788)
def7c4b2
update changelog
112e71a2
Add new metadata to ignore in local ingest tests
ddb53210
local connector metadata and deserialization fix <- Ingest test fixtu…
18104c49
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
b7c82288
Merge remote-tracking branch 'origin/roman/local-connector-metadata' …
07c81e56
feat: clean pdfminer elements inside tables <- Ingest test fixtures u…
6a8a792e
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
b8d483c4
refactor: changes way elements are removed from pages
c11e7c90
test: add test for clean_pdfminer_inner_elements
ce4ec12f
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
bf89035a
linting
ce6a1fe9
style: add typing to clean_pdfminer_inner_elements
9cb5c87c
fix: add generate_extra_info=False to several partition_pdf calls
a3b9d8d0
test: refactor way of instantiate MockPageLayout
f5e9a247
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
149514fd
fix: add missing argument generate_extra_info
f25dbab6
feat: clean pdfminer elements inside tables <- Ingest test fixtures u…
18048c2f
benjats07
marked this pull request as ready for review
2 years ago
benjats07
requested a review
from
qued
2 years ago
benjats07
requested a review
from
badGarnet
2 years ago
benjats07
requested a review
from
MthwRobinson
2 years ago
benjats07
requested a review
from
scanny
2 years ago
benjats07
requested a review
from
ajjimeno
2 years ago
benjats07
enabled auto-merge
2 years ago
scanny
requested changes on 2023-10-24
ajjimeno
commented on 2023-10-24
fix: minor issues in changelog
9fe4de49
refactor: renaming variale for creating dictionary of inner elements
657e59cd
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
2ae517e9
Linting
4975b41d
lint: delete unused imports
df011dff
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
8ccf7b50
scanny
approved these changes on 2023-10-24
cragwolfe
requested changes on 2023-10-26
christinestraub
requested changes on 2023-10-26
refactor: clean_pdfminer_inner_elements just removes elements
ba0594a8
test: update output of test
81ee8d66
refactor: removes unused argument
1825505b
fix: add error margin to is_in operation
4cc028fe
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
de1130c0
fix: Misspelled variable name
438f16ad
refactor: improvement on redability
11701768
Linting
436d4d5d
refactor: improvement on redability
c28daf81
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
0a4a4400
Changelog fixes and expected version update
b4964fb3
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
9c03c2eb
benjats07
requested a review
from
christinestraub
2 years ago
benjats07
requested a review
from
cragwolfe
2 years ago
christinestraub
requested changes on 2023-10-27
fix: deletes incorrect detection_origin
be4ef61d
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
412c2606
benjats07
enabled auto-merge
2 years ago
benjats07
requested a review
from
christinestraub
2 years ago
christinestraub
approved these changes on 2023-10-27
cragwolfe
commented on 2023-10-27
cragwolfe
commented on 2023-10-27
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
4b757fef
chore: removing changelog entries
68342dc4
chore: recover changelog from main and add info
21b58f94
chore: remove duplicated entry in this branch
dc95256f
chore: remove duplicated entry in this branch
3665cf63
cragwolfe
approved these changes on 2023-10-28
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
5fccd606
tests: added temporary fix to source checking
ad5d9673
test: update origins for tests
d10ebe8f
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
7339b84c
test: update origins for tests
7aeec3ce
Linting
470dd499
feat: clean pdfminer elements inside tables <- Ingest test fixtures u…
810d9095
Merge branch 'main' into benjamin/feat/clean-pdfminer-inner-elements
d2b35305
benjats07
merged
05c3cd1b
into main
2 years ago
benjats07
deleted the benjamin/feat/clean-pdfminer-inner-elements branch
2 years ago
Login to write a write a comment.
Login via GitHub
Reviewers
cragwolfe
christinestraub
scanny
ajjimeno
qued
badGarnet
MthwRobinson
Assignees
No one assigned
Labels
None yet
Milestone
No milestone
Login to write a write a comment.
Login via GitHub