Better element IDs - deterministic and document-unique hashes #2673
prototype solution for PDF files
e68a7f59
add basic tests for element IDs
f3f3321e
recalculate ID based on metadata (if present)
3398be3d
add more unit tests
76cbaef4
add HashValue class to identify when ID recalculation is required
272f4a61
Merge branch 'main' into CORE-3587/better-element-ids
3375f636
add test and a set of fixtures for unique and deterministic pdf elemeā¦
b71369c8
update hash computation so it allows for appending other data
e5d90ab3
add given when then comments
608bdbe9
add docstring
4c139de8
add html tests
02ca0929
revert unused change
0d336088
scanny
requested changes
on 2024-03-26
scanny
commented
on 2024-03-27
remove Text element tests for page_number and index_on_page
e813b896
recalculate_ids outside of the Text class
3f87ad2e
get rid of index_on_page
e2eea354
revert _id to id
d14a47ec
simplify hash calculation function
bc29126e
remove uuid.UUID from type hints for self.id
cd9b9b33
quickfix calculate_hash function call
f49c68d5
update PPTX test
eef42644
add docx test
b6e850f3
refactor ids recalculation by moving it to process_metadata decorator
7efe3a49
remove unused code
4c393f42
revert isinstance statement
8f9c4456
revert inline return statement
d33c86c5
add tests for calculating hash and recalculatind ids
3becd445
do dont mutate, but copy elements
183c38b8
update docs hashes
01f16c89
add doc tests
fbdaefe4
Merge branch 'main' into CORE-3587/better-element-ids
03fc1261
refactor recalculate_ids so it updates parent_id's correctly
289b1b33
rename calculate_hash into id_to_hash and make it a method
56783b5f
revert existing logic of assigning id's at construction-time
5509bb5d
remove unused code
82efb289
apply code review suggestions in tests
18124353
rename "recalculate_ids" with "assign_hash_ids"
bc35458d
remove test which is no longer relevant
cdac860e
update html test file and test itself
dfe446dd
add test_id_to_hash
1d01dfb0
handle edge case for xlsx files
93508660
update file name
dfba0fb0
revert original id in test
75f3e886
use deepcopy in test to compare if ids have changed
e179a55f
revert to construction-time UUIDs
6d93e0b3
explicit warning in assign_hash_id
baa05402
add dummy copy of id_to_hash to class "Name(EmailElement)"
d51b8c18
update hashes in tests
1e0a4a5d
micmarty-deepsense
changed the title [WIP] Better element IDs Better element IDs 1 year ago
adjust hash values for pptx hierarchy test
1f64d46b
remove unused file
4b5b84ca
adjust pdf hashes in a test
49d899da
Merge branch 'main' into CORE-3587/better-element-ids
24b7b5b6
update overview.rst
8911fa31
remove deprecated test
5d0ed03d
raise if element_id is not a string or NoId
6fe739a7
update CHANGELOG
135f8afc
quickfix ruff warnings
c578aa7f
quickfix changelog
dd0b9490
update __version__
ca53a97f
Better element IDs <- Ingest test fixtures update (#2832)
a6cae7b7
use hash for label studio annotations
6b2cffa3
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
fa4cb396
adjust email test
01267887
improve email element design
237636a5
fix chunking
007b7335
update the docstring for assign_hash_ids
23dbbb19
remove try except
3a6d04af
don't call id_to_uuid, elements already have UUIDs
515cb52c
move id_to_hash from Text to Element
652d6c2d
reorder methods to alphabetical order
6ed2c7ea
remove unused id_to_uuid
452e3cd7
update hashes in tests
86023f8d
Merge branch 'main' into CORE-3587/better-element-ids
3bd0745e
Better element IDs <- Ingest test fixtures update (#2839)
810dce14
remove unused imports
5a58acd9
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
a4e654f6
update hashes
298fad17
refactor one test in test_email_elements.py
339a440e
fix KeyErrors for stanley-cups
0d65b029
merge 2 tests into 1
9dbf6ae7
update pdf hashes
a2e4302d
fix label studio tests
cd3cdc4d
fix baseplate tests
2d270576
add element ID design principles section in the documentation
d1ecb40a
Better element IDs <- Ingest test fixtures update (#2840)
fd3b55a4
update Element docstrings
ebb1209e
change num of expected files in local ingest from 12 to 13
2c398e65
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
6b5dddf3
modify default behavior of Element, Text, and Name class
639423a8
quickfix id initialization in Element
ad9f58f2
move id initialization to Element
ee900960
refactor id assertions in test_elements.py
2d9a127a
quickfix bug, forgot to remove invalid assignment
f5c650ad
add changelog entry
11c4041d
adjust email tests
20d7c2fd
fix chunking
4aadf22a
remove unnecessary enumeration and remove argument to id_to_hash
45973eef
remove unused import
3f517450
quickfix support for | operand in 3.9
bc93f544
add design principles in overview.rst
e9a5dcfe
fix staging test by using deterministic hashes
f36f76c9
fix tests that were failing due to invalid text_as_html consolidation
71e70d72
add empty lines
17d6585d
quickfix typo
511cc057
micmarty-deepsense
changed the title Better element IDs Better element IDs - unique and deterministic hashes 1 year ago
micmarty-deepsense
changed the title Better element IDs - unique and deterministic hashes Better element IDs - deterministic and document-unique hashes 1 year ago
parametrize test_text_uuid
96e5b674
Preparing the ground for better element IDs <- Ingest test fixtures uā¦
d555736a
adjust ingestion chunking config
1a38d700
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:ā¦
920837a0
adjust ingestion chunking config
933cef02
Preparing the ground for better element IDs <- Ingest test fixtures uā¦
53599d39
Better element IDs - deterministic and document-unique hashes <- Ingeā¦
9507a96e
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
39d79588
use hashes in partitioner
939f54d5
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
c4c91a56
Merge branch 'main' into CORE-3587/better-element-ids
793ef376
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
36aeefd9
remove unused import
f0e01490
Merge branch 'mike/preparing-ground-for-better-element-ids' of https:ā¦
1c351394
move id_to_hash to interfaces.py
bec0b90a
Better element IDs - deterministic and document-unique hashes <- Ingeā¦
b5e53bb9
ignore mongodb.sh in test-ingest-src.sh
38cd4fab
remove redundant loop with id_to_hash
ac945886
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
52e830b7
Merge branch 'main' into CORE-3587/better-element-ids
70528963
update changelog and sync version
c5b16f3c
Better element IDs - deterministic and document-unique hashes <- Ingeā¦
af879ab0
revert ignoring mongodb.sh
82546ed1
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
79afe977
Merge branch 'main' into mike/preparing-ground-for-better-element-ids
ea6b8817
rename assign_hash_ids to assign_and_map_hash_ids
4d9dbc2b
change expected argument type for element_id in CheckBox
adb4592b
add a test utility for assigning hash ids
836e5149
more detailed element test
c0add80c
rename test
e20666d6
remove redundant line
cf622303
bump version
178bf57a
Merge branch 'mike/preparing-ground-for-better-element-ids' into COREā¦
73d8edd5
update test name
170141f4
quickfix amgiguity in hash assigning function calls
3937ac4a
update CHANGELOG
3ee35ce5
remove unused import
5017e6c6
adjust hashes in test
f0def7b1
fix missing argument to id_to_hash
560d2cd4
update hash in test
bde2907f
update email tests
ba28243b
Better element IDs - deterministic and document-unique hashes <- Ingeā¦
ca5861c8
scanny
requested changes
on 2024-04-08
fix a bug: sharing one memory address
2a9f0b88
refactor assign_and_map_hash_ids according to review sugestions
21914f2b
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
6e80fb4d
make pytest.mark.parametrize body compact
3916d0d5
add 2 example docs and adjust related tests
3250abe7
move assign_hash_ids from test_utils to unit_utils
fe7fa006
apply other minor review suggestions
66c3f237
remove unused import
8fa2666f
add pdf with duplicate page and refactor related test
aab6bad6
quickfix importing assign_hash_ids
4fd7d627
remove unused imports
90a1880b
get rid of List type
ae2cd30f
Merge branch 'main' into CORE-3587/better-element-ids
c0d1bb10
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
bdc0c3b6
remove unused imports
461f9b93
remove unused imports
624ba1a0
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
8e100f76
Better element IDs - deterministic and document-unique hashes <- Ingeā¦
4ef4821a
remove unused argument
a3e5d60a
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
562df253
scanny
approved these changes
on 2024-04-09
scanny
requested changes
on 2024-04-10
Merge branch 'main' into CORE-3587/better-element-ids
b17c80f3
clean up after resolving conflicts
89c7f271
update hash ids for test
e2f4c3c0
use seq_on_page in hash calculation
3c07881c
support for starting_page_number in ODT files
e0b02ec7
update hashes for doc and docx tests, remove redundant assertion
cf45f7a5
include filename in hash calculation
ff2fd2f9
fix bug of sharing one metadata object by multiple elements for msg fā¦
eeb1ea6f
update hashes in tests and refactor them slightly
33ae279a
adjust pptx test cases
f6ec6a0c
update hashes for staging tests
9f4adede
update hashes for PDF tests
3dad5aef
fix line too long
44638309
reformat elements.py and add more comments
a93a1560
Better element IDs - deterministic and document-unique hashes <- Ingeā¦
c8b7c66e
update changelog and version
d8e9a2f2
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
ee0392a0
update overview.rst
e578abab
make tests more compact
b9baf8a0
update html hashes
a2227fc8
remove redundant uniqueness assertion
b22784c6
update almost all hashes in spring-weather (there are still problematā¦
68f93bd5
update hashes for spring-weather
f50c13c6
revert spring water example doc to original
9ac682ff
Merge branch 'main' into CORE-3587/better-element-ids
b10b3c1b
assign hash ids when doing ingestion
1178ca0a
revert all changes to test_unstructured_ingest
232c4054
Better element IDs - deterministic and document-unique hashes <- Ingeā¦
973dc29b
increase num of expected files in local.sh
32806259
Merge branch 'CORE-3587/better-element-ids' of https://github.com/Unsā¦
22991c83
Merge branch 'main' into CORE-3587/better-element-ids
cc8be15d
update version
779db464
refactor 1 test in test_auto.py
af28f772
remove changelong entry duplicate
ca139021
Merge branch 'main' into CORE-3587/better-element-ids
f4fd49a4
scanny
dismissed these changes
on 2024-04-24
Merge branch 'main' into CORE-3587/better-element-ids
4a0b27d5
cragwolfe
dismissed their stale review
1 year ago
cragwolfe
approved these changes
on 2024-04-24
cragwolfe
merged
2d1923ac
into main 1 year ago
cragwolfe
deleted the CORE-3587/better-element-ids branch 1 year ago