unstructured
Better element IDs - deterministic and document-unique hashes
#2673
Merged

Better element IDs - deterministic and document-unique hashes #2673

cragwolfe merged 201 commits into main from CORE-3587/better-element-ids
micmarty-deepsense
micmarty-deepsense prototype solution for PDF files
e68a7f59
micmarty-deepsense add basic tests for element IDs
f3f3321e
micmarty-deepsense recalculate ID based on metadata (if present)
3398be3d
micmarty-deepsense add more unit tests
76cbaef4
micmarty-deepsense add HashValue class to identify when ID recalculation is required
272f4a61
micmarty-deepsense
micmarty-deepsense commented on 2024-03-21
micmarty-deepsense
micmarty-deepsense commented on 2024-03-21
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
3375f636
micmarty-deepsense add test and a set of fixtures for unique and deterministic pdf eleme…
b71369c8
micmarty-deepsense update hash computation so it allows for appending other data
e5d90ab3
micmarty-deepsense add given when then comments
608bdbe9
micmarty-deepsense
micmarty-deepsense commented on 2024-03-26
micmarty-deepsense add docstring
4c139de8
micmarty-deepsense add html tests
02ca0929
micmarty-deepsense revert unused change
0d336088
scanny
scanny requested changes on 2024-03-26
scanny
scanny commented on 2024-03-27
micmarty-deepsense remove Text element tests for page_number and index_on_page
e813b896
micmarty-deepsense recalculate_ids outside of the Text class
3f87ad2e
micmarty-deepsense get rid of index_on_page
e2eea354
micmarty-deepsense revert _id to id
d14a47ec
micmarty-deepsense simplify hash calculation function
bc29126e
micmarty-deepsense remove uuid.UUID from type hints for self.id
cd9b9b33
micmarty-deepsense quickfix calculate_hash function call
f49c68d5
micmarty-deepsense update PPTX test
eef42644
micmarty-deepsense add docx test
b6e850f3
micmarty-deepsense refactor ids recalculation by moving it to process_metadata decorator
7efe3a49
micmarty-deepsense remove unused code
4c393f42
micmarty-deepsense micmarty-deepsense force pushed from 255e9f90 to 4c393f42 1 year ago
micmarty-deepsense revert isinstance statement
8f9c4456
micmarty-deepsense revert inline return statement
d33c86c5
micmarty-deepsense add tests for calculating hash and recalculatind ids
3becd445
micmarty-deepsense do dont mutate, but copy elements
183c38b8
micmarty-deepsense update docs hashes
01f16c89
micmarty-deepsense add doc tests
fbdaefe4
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
03fc1261
micmarty-deepsense
micmarty-deepsense commented on 2024-03-28
micmarty-deepsense
micmarty-deepsense commented on 2024-03-28
micmarty-deepsense
micmarty-deepsense commented on 2024-03-28
micmarty-deepsense refactor recalculate_ids so it updates parent_id's correctly
289b1b33
micmarty-deepsense rename calculate_hash into id_to_hash and make it a method
56783b5f
micmarty-deepsense revert existing logic of assigning id's at construction-time
5509bb5d
micmarty-deepsense remove unused code
82efb289
micmarty-deepsense micmarty-deepsense force pushed from b13f3817 to d42cd491 1 year ago
micmarty-deepsense micmarty-deepsense force pushed from d42cd491 to 1d3ef6a1 1 year ago
micmarty-deepsense apply code review suggestions in tests
18124353
micmarty-deepsense micmarty-deepsense force pushed from 1d3ef6a1 to 18124353 1 year ago
micmarty-deepsense rename "recalculate_ids" with "assign_hash_ids"
bc35458d
micmarty-deepsense remove test which is no longer relevant
cdac860e
micmarty-deepsense update html test file and test itself
dfe446dd
micmarty-deepsense add test_id_to_hash
1d01dfb0
micmarty-deepsense handle edge case for xlsx files
93508660
micmarty-deepsense update file name
dfba0fb0
micmarty-deepsense revert original id in test
75f3e886
micmarty-deepsense use deepcopy in test to compare if ids have changed
e179a55f
micmarty-deepsense revert to construction-time UUIDs
6d93e0b3
micmarty-deepsense explicit warning in assign_hash_id
baa05402
micmarty-deepsense add dummy copy of id_to_hash to class "Name(EmailElement)"
d51b8c18
micmarty-deepsense update hashes in tests
1e0a4a5d
micmarty-deepsense micmarty-deepsense changed the title [WIP] Better element IDs Better element IDs 1 year ago
micmarty-deepsense adjust hash values for pptx hierarchy test
1f64d46b
micmarty-deepsense remove unused file
4b5b84ca
micmarty-deepsense adjust pdf hashes in a test
49d899da
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
24b7b5b6
micmarty-deepsense update overview.rst
8911fa31
micmarty-deepsense micmarty-deepsense force pushed from 76e44d65 to 8911fa31 1 year ago
micmarty-deepsense remove deprecated test
5d0ed03d
micmarty-deepsense raise if element_id is not a string or NoId
6fe739a7
micmarty-deepsense micmarty-deepsense force pushed from 753bd6ae to 6fe739a7 1 year ago
micmarty-deepsense
micmarty-deepsense commented on 2024-04-02
micmarty-deepsense
micmarty-deepsense commented on 2024-04-02
micmarty-deepsense micmarty-deepsense marked this pull request as ready for review 1 year ago
micmarty-deepsense update CHANGELOG
135f8afc
micmarty-deepsense quickfix ruff warnings
c578aa7f
micmarty-deepsense micmarty-deepsense force pushed from a5c76089 to c578aa7f 1 year ago
micmarty-deepsense quickfix changelog
dd0b9490
micmarty-deepsense update __version__
ca53a97f
ryannikolaidis Better element IDs <- Ingest test fixtures update (#2832)
a6cae7b7
micmarty-deepsense use hash for label studio annotations
6b2cffa3
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
fa4cb396
micmarty-deepsense adjust email test
01267887
micmarty-deepsense improve email element design
237636a5
micmarty-deepsense fix chunking
007b7335
micmarty-deepsense update the docstring for assign_hash_ids
23dbbb19
micmarty-deepsense remove try except
3a6d04af
micmarty-deepsense don't call id_to_uuid, elements already have UUIDs
515cb52c
micmarty-deepsense move id_to_hash from Text to Element
652d6c2d
micmarty-deepsense reorder methods to alphabetical order
6ed2c7ea
micmarty-deepsense remove unused id_to_uuid
452e3cd7
micmarty-deepsense update hashes in tests
86023f8d
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
3bd0745e
ryannikolaidis Better element IDs <- Ingest test fixtures update (#2839)
810dce14
micmarty-deepsense remove unused imports
5a58acd9
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
a4e654f6
micmarty-deepsense update hashes
298fad17
micmarty-deepsense refactor one test in test_email_elements.py
339a440e
micmarty-deepsense fix KeyErrors for stanley-cups
0d65b029
micmarty-deepsense merge 2 tests into 1
9dbf6ae7
micmarty-deepsense update pdf hashes
a2e4302d
micmarty-deepsense fix label studio tests
cd3cdc4d
micmarty-deepsense fix baseplate tests
2d270576
micmarty-deepsense add element ID design principles section in the documentation
d1ecb40a
ryannikolaidis Better element IDs <- Ingest test fixtures update (#2840)
fd3b55a4
micmarty-deepsense update Element docstrings
ebb1209e
micmarty-deepsense change num of expected files in local ingest from 12 to 13
2c398e65
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
6b5dddf3
micmarty-deepsense modify default behavior of Element, Text, and Name class
639423a8
micmarty-deepsense quickfix id initialization in Element
ad9f58f2
micmarty-deepsense move id initialization to Element
ee900960
micmarty-deepsense refactor id assertions in test_elements.py
2d9a127a
micmarty-deepsense quickfix bug, forgot to remove invalid assignment
f5c650ad
micmarty-deepsense add changelog entry
11c4041d
micmarty-deepsense adjust email tests
20d7c2fd
micmarty-deepsense fix chunking
4aadf22a
micmarty-deepsense remove unnecessary enumeration and remove argument to id_to_hash
45973eef
micmarty-deepsense remove unused import
3f517450
micmarty-deepsense quickfix support for | operand in 3.9
bc93f544
micmarty-deepsense add design principles in overview.rst
e9a5dcfe
micmarty-deepsense fix staging test by using deterministic hashes
f36f76c9
micmarty-deepsense fix tests that were failing due to invalid text_as_html consolidation
71e70d72
micmarty-deepsense add empty lines
17d6585d
micmarty-deepsense quickfix typo
511cc057
micmarty-deepsense micmarty-deepsense changed the title Better element IDs Better element IDs - unique and deterministic hashes 1 year ago
micmarty-deepsense micmarty-deepsense changed the title Better element IDs - unique and deterministic hashes Better element IDs - deterministic and document-unique hashes 1 year ago
micmarty-deepsense parametrize test_text_uuid
96e5b674
ryannikolaidis Preparing the ground for better element IDs <- Ingest test fixtures u…
d555736a
micmarty-deepsense adjust ingestion chunking config
1a38d700
micmarty-deepsense Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
920837a0
micmarty-deepsense adjust ingestion chunking config
933cef02
ryannikolaidis Preparing the ground for better element IDs <- Ingest test fixtures u…
53599d39
ryannikolaidis Better element IDs - deterministic and document-unique hashes <- Inge…
9507a96e
micmarty-deepsense Merge branch 'main' into mike/preparing-ground-for-better-element-ids
39d79588
micmarty-deepsense use hashes in partitioner
939f54d5
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
c4c91a56
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
793ef376
micmarty-deepsense Merge branch 'main' into mike/preparing-ground-for-better-element-ids
36aeefd9
micmarty-deepsense remove unused import
f0e01490
micmarty-deepsense Merge branch 'mike/preparing-ground-for-better-element-ids' of https:…
1c351394
micmarty-deepsense move id_to_hash to interfaces.py
bec0b90a
ryannikolaidis Better element IDs - deterministic and document-unique hashes <- Inge…
b5e53bb9
micmarty-deepsense ignore mongodb.sh in test-ingest-src.sh
38cd4fab
micmarty-deepsense remove redundant loop with id_to_hash
ac945886
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
52e830b7
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
70528963
micmarty-deepsense update changelog and sync version
c5b16f3c
ryannikolaidis Better element IDs - deterministic and document-unique hashes <- Inge…
af879ab0
micmarty-deepsense revert ignoring mongodb.sh
82546ed1
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
79afe977
micmarty-deepsense Merge branch 'main' into mike/preparing-ground-for-better-element-ids
ea6b8817
micmarty-deepsense micmarty-deepsense requested a review from scanny scanny 1 year ago
micmarty-deepsense rename assign_hash_ids to assign_and_map_hash_ids
4d9dbc2b
micmarty-deepsense change expected argument type for element_id in CheckBox
adb4592b
micmarty-deepsense add a test utility for assigning hash ids
836e5149
micmarty-deepsense more detailed element test
c0add80c
micmarty-deepsense rename test
e20666d6
micmarty-deepsense remove redundant line
cf622303
micmarty-deepsense bump version
178bf57a
micmarty-deepsense Merge branch 'mike/preparing-ground-for-better-element-ids' into CORE…
73d8edd5
micmarty-deepsense update test name
170141f4
micmarty-deepsense quickfix amgiguity in hash assigning function calls
3937ac4a
micmarty-deepsense update CHANGELOG
3ee35ce5
micmarty-deepsense remove unused import
5017e6c6
micmarty-deepsense adjust hashes in test
f0def7b1
micmarty-deepsense fix missing argument to id_to_hash
560d2cd4
micmarty-deepsense update hash in test
bde2907f
micmarty-deepsense update email tests
ba28243b
ryannikolaidis Better element IDs - deterministic and document-unique hashes <- Inge…
ca5861c8
scanny
scanny requested changes on 2024-04-08
micmarty-deepsense fix a bug: sharing one memory address
2a9f0b88
micmarty-deepsense refactor assign_and_map_hash_ids according to review sugestions
21914f2b
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
6e80fb4d
micmarty-deepsense make pytest.mark.parametrize body compact
3916d0d5
micmarty-deepsense add 2 example docs and adjust related tests
3250abe7
micmarty-deepsense move assign_hash_ids from test_utils to unit_utils
fe7fa006
micmarty-deepsense apply other minor review suggestions
66c3f237
micmarty-deepsense remove unused import
8fa2666f
micmarty-deepsense add pdf with duplicate page and refactor related test
aab6bad6
micmarty-deepsense quickfix importing assign_hash_ids
4fd7d627
micmarty-deepsense remove unused imports
90a1880b
micmarty-deepsense get rid of List type
ae2cd30f
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
c0d1bb10
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
bdc0c3b6
micmarty-deepsense remove unused imports
461f9b93
micmarty-deepsense remove unused imports
624ba1a0
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
8e100f76
ryannikolaidis Better element IDs - deterministic and document-unique hashes <- Inge…
4ef4821a
micmarty-deepsense
micmarty-deepsense micmarty-deepsense closed this 1 year ago
micmarty-deepsense micmarty-deepsense reopened this 1 year ago
micmarty-deepsense remove unused argument
a3e5d60a
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
562df253
micmarty-deepsense micmarty-deepsense requested a review from scanny scanny 1 year ago
scanny
scanny approved these changes on 2024-04-09
scanny
scanny requested changes on 2024-04-10
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
b17c80f3
micmarty-deepsense clean up after resolving conflicts
89c7f271
micmarty-deepsense micmarty-deepsense force pushed from 62b927ec to 89c7f271 1 year ago
micmarty-deepsense update hash ids for test
e2f4c3c0
micmarty-deepsense use seq_on_page in hash calculation
3c07881c
micmarty-deepsense support for starting_page_number in ODT files
e0b02ec7
micmarty-deepsense update hashes for doc and docx tests, remove redundant assertion
cf45f7a5
micmarty-deepsense include filename in hash calculation
ff2fd2f9
micmarty-deepsense fix bug of sharing one metadata object by multiple elements for msg f…
eeb1ea6f
micmarty-deepsense update hashes in tests and refactor them slightly
33ae279a
micmarty-deepsense adjust pptx test cases
f6ec6a0c
micmarty-deepsense update hashes for staging tests
9f4adede
micmarty-deepsense update hashes for PDF tests
3dad5aef
micmarty-deepsense fix line too long
44638309
micmarty-deepsense reformat elements.py and add more comments
a93a1560
ryannikolaidis Better element IDs - deterministic and document-unique hashes <- Inge…
c8b7c66e
micmarty-deepsense update changelog and version
d8e9a2f2
micmarty-deepsense micmarty-deepsense force pushed from 25cc25dc to 5a8c4d73 1 year ago
micmarty-deepsense micmarty-deepsense force pushed from 5a8c4d73 to 865336a1 1 year ago
micmarty-deepsense micmarty-deepsense force pushed from 865336a1 to 3421d871 1 year ago
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
ee0392a0
micmarty-deepsense micmarty-deepsense force pushed from 3421d871 to ee0392a0 1 year ago
micmarty-deepsense update overview.rst
e578abab
micmarty-deepsense
micmarty-deepsense commented on 2024-04-18
micmarty-deepsense make tests more compact
b9baf8a0
micmarty-deepsense update html hashes
a2227fc8
micmarty-deepsense remove redundant uniqueness assertion
b22784c6
micmarty-deepsense
scanny
micmarty-deepsense
cragwolfe
scanny
cragwolfe
micmarty-deepsense update almost all hashes in spring-weather (there are still problemat…
68f93bd5
micmarty-deepsense update hashes for spring-weather
f50c13c6
micmarty-deepsense revert spring water example doc to original
9ac682ff
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
b10b3c1b
micmarty-deepsense assign hash ids when doing ingestion
1178ca0a
micmarty-deepsense revert all changes to test_unstructured_ingest
232c4054
ryannikolaidis Better element IDs - deterministic and document-unique hashes <- Inge…
973dc29b
micmarty-deepsense increase num of expected files in local.sh
32806259
micmarty-deepsense Merge branch 'CORE-3587/better-element-ids' of https://github.com/Uns…
22991c83
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
cc8be15d
micmarty-deepsense micmarty-deepsense force pushed from f822fff7 to 9e2f2a74 1 year ago
micmarty-deepsense micmarty-deepsense force pushed from 9e2f2a74 to f7a9af81 1 year ago
micmarty-deepsense update version
779db464
micmarty-deepsense micmarty-deepsense force pushed from f9bdd432 to 43300f00 1 year ago
micmarty-deepsense refactor 1 test in test_auto.py
af28f772
micmarty-deepsense micmarty-deepsense force pushed from 43300f00 to af28f772 1 year ago
micmarty-deepsense
micmarty-deepsense commented on 2024-04-22
micmarty-deepsense
micmarty-deepsense commented on 2024-04-22
cragwolfe
cragwolfe commented on 2024-04-23
micmarty-deepsense remove changelong entry duplicate
ca139021
micmarty-deepsense Merge branch 'main' into CORE-3587/better-element-ids
f4fd49a4
micmarty-deepsense micmarty-deepsense requested a review from scanny scanny 1 year ago
scanny
scanny dismissed these changes on 2024-04-24
cragwolfe Merge branch 'main' into CORE-3587/better-element-ids
4a0b27d5
cragwolfe cragwolfe dismissed their stale review 1 year ago
remaining threads can be addressed in follow-on PRs
cragwolfe
cragwolfe approved these changes on 2024-04-24
cragwolfe cragwolfe merged 2d1923ac into main 1 year ago
cragwolfe cragwolfe deleted the CORE-3587/better-element-ids branch 1 year ago

Login to write a write a comment.

Login via GitHub

Assignees
Labels
Milestone