unstructured
402c3ab9 - enhancement: Speed up function `stage_for_datasaur` by 8% (#4176)

Commit

148 days ago

enhancement: Speed up function `stage_for_datasaur` by 8% (#4176)  #### 📄 8% (0.08x) speedup for ***`stage_for_datasaur` in `unstructured/staging/datasaur.py`*** ⏱️ Runtime : **`1.69 milliseconds`** **→** **`1.56 milliseconds`** (best of `250` runs) #### 📝 Explanation and details The optimization replaces the explicit loop-based result construction with a **list comprehension**. This change eliminates the intermediate `result` list initialization and the repeated `append()` operations. **Key changes:** - Removed `result: List[Dict[str, Any]] = []` initialization - Replaced the `for i, item in enumerate(elements):` loop with a single list comprehension: `return [{"text": item.text, "entities": _entities[i]} for i, item in enumerate(elements)]` - Eliminated multiple `result.append(data)` calls **Why this is faster:** List comprehensions in Python are implemented in C and execute significantly faster than equivalent explicit loops with append operations. The optimization eliminates the overhead of: - Creating an empty list and growing it incrementally - Multiple function calls to `append()` - Temporary variable assignment (`data`) **Performance characteristics:** The profiler shows this optimization is most effective for larger datasets - the annotated tests demonstrate **18-20% speedup** for 1000+ elements, while smaller datasets see modest gains or slight overhead due to the comprehension setup cost. The optimization delivers consistent **6-10% improvements** for medium-scale workloads (500+ elements with entities). **Impact on workloads:** This optimization will benefit any application processing substantial amounts of text data for Datasaur formatting, particularly document processing pipelines or batch entity annotation workflows where hundreds or thousands of text elements are processed together. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **6 Passed** | | 🌀 Generated Regression Tests | ✅ **37 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | ✅ **3 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------------------------------|:--------------|:---------------|:----------| | `staging/test_datasaur.py::test_datasaur_raises_with_bad_type` | 2.67μs | 2.50μs | 6.64%✅ | | `staging/test_datasaur.py::test_datasaur_raises_with_missing_entity_text` | 1.04μs | 1.04μs | -0.096%⚠️ | | `staging/test_datasaur.py::test_datasaur_raises_with_missing_key` | 2.08μs | 1.96μs | 6.33%✅ | | `staging/test_datasaur.py::test_datasaur_raises_with_wrong_length` | 1.08μs | 1.04μs | 4.03%✅ | | `staging/test_datasaur.py::test_stage_for_datasaur` | 1.29μs | 1.33μs | -3.08%⚠️ | | `staging/test_datasaur.py::test_stage_for_datasaur_with_entities` | 2.50μs | 2.46μs | 1.67%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python # imports import pytest from unstructured.staging.datasaur import stage_for_datasaur # Mock class for Text, as per unstructured.documents.elements.Text class Text: def __init__(self, text: str): self.text = text # unit tests # --------------------------- # Basic Test Cases # --------------------------- def test_single_element_no_entities(): # Single Text element, no entities elements = [Text("hello world")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.12μs -> 1.25μs (10.0% slower) def test_multiple_elements_no_entities(): # Multiple Text elements, no entities elements = [Text("a"), Text("b"), Text("c")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.38μs -> 1.38μs (0.000% faster) for i, letter in enumerate(["a", "b", "c"]): pass def test_single_element_with_single_entity(): # Single element, one entity elements = [Text("hello world")] entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.04μs -> 2.04μs (0.000% faster) def test_multiple_elements_with_entities(): # Multiple elements, each with entities elements = [Text("foo bar"), Text("baz qux")] entities = [ [{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}], [{"text": "qux", "type": "NOUN", "start_idx": 4, "end_idx": 7}], ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.50μs -> 2.58μs (3.21% slower) def test_elements_with_mixed_entities(): # Some elements have entities, some do not elements = [Text("foo bar"), Text("baz qux")] entities = [[], [{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.08μs -> 2.08μs (0.000% faster) # --------------------------- # Edge Test Cases # --------------------------- def test_empty_elements_list(): # Empty input list elements = [] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 791ns -> 875ns (9.60% slower) def test_entities_length_mismatch(): # entities list length does not match elements length elements = [Text("foo"), Text("bar")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}]] with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 916ns -> 875ns (4.69% faster) def test_entity_missing_key(): # Entity is missing a required key elements = [Text("foo")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0}]] # missing 'end_idx' with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 1.83μs -> 1.75μs (4.74% faster) def test_entity_wrong_type(): # Entity has wrong type for a key elements = [Text("foo")] entities = [ [{"text": "foo", "type": "NOUN", "start_idx": "0", "end_idx": 3}] ] # 'start_idx' should be int with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 2.42μs -> 2.33μs (3.60% faster) def test_entity_extra_keys(): # Entity has extra keys (should not error) elements = [Text("foo")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3, "confidence": 0.99}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.00μs -> 2.04μs (2.01% slower) def test_entities_is_none(): # entities explicitly passed as None elements = [Text("foo")] codeflash_output = stage_for_datasaur(elements, None) result = codeflash_output # 1.04μs -> 1.08μs (3.79% slower) def test_entity_empty_list(): # entities is a list of empty lists (should be valid) elements = [Text("foo"), Text("bar")] entities = [[], []] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.42μs -> 1.50μs (5.60% slower) def test_entity_text_not_matching_element(): # Entity text does not match element text (should not error) elements = [Text("foobar")] entities = [[{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.00μs -> 2.00μs (0.000% faster) def test_entity_indices_out_of_bounds(): # Entity indices out of text bounds (should not error) elements = [Text("foo")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 10}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.96μs -> 2.00μs (2.10% slower) # --------------------------- # Large Scale Test Cases # --------------------------- def test_large_number_of_elements(): # Test with 1000 elements, no entities n = 1000 elements = [Text(str(i)) for i in range(n)] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 102μs -> 87.0μs (18.1% faster) for i in range(n): pass def test_large_number_of_elements_with_entities(): # Test with 500 elements, each with one entity n = 500 elements = [Text(f"text_{i}") for i in range(n)] entities = [ [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}] for i in range(n) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 244μs -> 227μs (7.83% faster) for i in range(n): pass def test_large_number_of_entities_per_element(): # Test with 10 elements, each with 100 entities elements = [Text(f"text_{i}") for i in range(10)] entities = [ [{"text": f"t_{j}", "type": "TYPE", "start_idx": j, "end_idx": j + 1} for j in range(100)] for _ in range(10) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 356μs -> 337μs (5.73% faster) for i in range(10): for j in range(100): pass # --------------------------- # Mutation Testing Guards # --------------------------- def test_mutation_guard_wrong_text_key(): # Changing the output key 'text' should fail elements = [Text("foo")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.00μs -> 1.04μs (4.03% slower) def test_mutation_guard_wrong_entities_key(): # Changing the output key 'entities' should fail elements = [Text("foo")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 958ns -> 1.00μs (4.20% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # imports import pytest from unstructured.staging.datasaur import stage_for_datasaur # Dummy Text class for testing, since unstructured.documents.elements.Text is not available class Text: def __init__(self, text: str): self.text = text # unit tests # --------------------- Basic Test Cases --------------------- def test_single_element_no_entities(): # One element, no entities elements = [Text("hello world")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.17μs -> 1.21μs (3.47% slower) def test_multiple_elements_no_entities(): # Multiple elements, no entities elements = [Text("foo"), Text("bar"), Text("baz")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.29μs -> 1.33μs (3.15% slower) def test_single_element_with_valid_entities(): # One element, one valid entity elements = [Text("hello world")] entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.04μs -> 2.00μs (2.05% faster) def test_multiple_elements_with_entities(): # Multiple elements, each with their own entities elements = [Text("foo bar"), Text("baz qux")] entities = [ [{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}], [{"text": "qux", "type": "WORD", "start_idx": 4, "end_idx": 7}], ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.62μs -> 2.50μs (5.00% faster) def test_multiple_elements_some_empty_entities(): # Multiple elements, some with no entities elements = [Text("foo bar"), Text("baz qux")] entities = [ [], [{"text": "baz", "type": "WORD", "start_idx": 0, "end_idx": 3}], ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.08μs -> 2.08μs (0.048% slower) # --------------------- Edge Test Cases --------------------- def test_empty_elements_list(): # No elements elements = [] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 750ns -> 875ns (14.3% slower) def test_empty_elements_with_empty_entities(): # No elements, entities is empty list elements = [] entities = [] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 833ns -> 1.00μs (16.7% slower) def test_entities_length_mismatch(): # entities list length does not match elements list length elements = [Text("foo"), Text("bar")] entities = [[]] # Should be length 2 with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 916ns -> 875ns (4.69% faster) def test_entity_missing_key(): # Entity dict missing a required key elements = [Text("foo")] entities = [[{"text": "foo", "type": "WORD", "start_idx": 0}]] # Missing 'end_idx' with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 1.92μs -> 1.75μs (9.49% faster) def test_entity_wrong_type(): # Entity dict with wrong type for a key elements = [Text("foo")] entities = [ [{"text": "foo", "type": "WORD", "start_idx": "zero", "end_idx": 3}] ] # start_idx should be int with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 2.46μs -> 2.33μs (5.36% faster) def test_entity_extra_keys(): # Entity dict with extra keys (should be ignored) elements = [Text("foo")] entities = [[{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3, "extra": "ignored"}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.96μs -> 2.00μs (2.05% slower) def test_entity_with_empty_string(): # Entity with empty string values (should be allowed) elements = [Text("")] entities = [[{"text": "", "type": "", "start_idx": 0, "end_idx": 0}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.96μs -> 1.96μs (0.000% faster) def test_entity_with_negative_indices(): # Entity with negative indices (should be allowed, not validated) elements = [Text("foo")] entities = [[{"text": "foo", "type": "WORD", "start_idx": -1, "end_idx": -1}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.83μs -> 1.88μs (2.24% slower) # --------------------- Large Scale Test Cases --------------------- def test_large_number_of_elements_no_entities(): # Large number of elements, no entities n = 1000 elements = [Text(f"text_{i}") for i in range(n)] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 103μs -> 86.7μs (19.7% faster) for i in range(n): pass def test_large_number_of_elements_with_entities(): # Large number of elements, each with one entity n = 1000 elements = [Text(f"text_{i}") for i in range(n)] entities = [ [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}] for i in range(n) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 502μs -> 470μs (6.85% faster) for i in range(n): pass def test_large_number_of_elements_some_with_entities(): # Large number of elements, only even indices have entities n = 1000 elements = [Text(f"text_{i}") for i in range(n)] entities = [ ( [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}] if i % 2 == 0 else [] ) for i in range(n) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 309μs -> 282μs (9.66% faster) for i in range(n): if i % 2 == 0: pass else: pass # --------------------- Determinism Test --------------------- def test_determinism(): # Running the function twice with the same input should yield the same result elements = [Text("foo"), Text("bar")] entities = [ [{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}], [{"text": "bar", "type": "WORD", "start_idx": 0, "end_idx": 3}], ] codeflash_output = stage_for_datasaur(elements, entities) result1 = codeflash_output # 2.75μs -> 2.67μs (3.15% faster) codeflash_output = stage_for_datasaur(elements, entities) result2 = codeflash_output # 1.58μs -> 1.54μs (2.66% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python import pytest from unstructured.documents.elements import Text from unstructured.staging.datasaur import stage_for_datasaur def test_stage_for_datasaur(): stage_for_datasaur( [ Text( "", element_id=None, coordinates=None, coordinate_system=None, metadata=None, detection_origin="", embeddings=[], ) ], entities=[[]], ) def test_stage_for_datasaur_2(): with pytest.raises( ValueError, match="If\\ entities\\ is\\ specified,\\ it\\ must\\ be\\ the\\ same\\ length\\ as\\ elements\\.", ): stage_for_datasaur([], entities=[[]]) def test_stage_for_datasaur_3(): with pytest.raises( ValueError, match="Key\\ 'text'\\ was\\ expected\\ but\\ not\\ present\\ in\\ the\\ Datasaur\\ entity\\.", ): stage_for_datasaur( [ Text( "", element_id=None, coordinates=None, coordinate_system=None, metadata=None, detection_origin="", embeddings=[0.0], ) ], entities=[[{}, {}]], ) ``` </details> <details> <summary>🔎 Concolic Coverage Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-----------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur` | 1.29μs | 1.46μs | -11.4%⚠️ | | `codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_2` | 916ns | 959ns | -4.48%⚠️ | | `codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_3` | 1.71μs | 1.67μs | 2.52%✅ | </details> To edit these changes `git checkout codeflash/optimize-stage_for_datasaur-mjdt0e1s` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>

References

#4176 - ⚡️ Speed up function `stage_for_datasaur` by 8%

Author

aseembits93

Parents

e670864b

unstructured 402c3ab9 - enhancement: Speed up function `stage_for_datasaur` by 8% (#4176)

unstructured
402c3ab9 - enhancement: Speed up function `stage_for_datasaur` by 8% (#4176)