enhancement: Speed up function `stage_for_datasaur` by 8% (#4176)
<!-- CODEFLASH_OPTIMIZATION:
{"function":"stage_for_datasaur","file":"unstructured/staging/datasaur.py","speedup_pct":"8%","speedup_x":"0.08x","original_runtime":"1.69
milliseconds","best_runtime":"1.56
milliseconds","optimization_type":"loop","timestamp":"2025-12-20T04:34:26.272Z","version":"1.0"}
-->
#### 📄 8% (0.08x) speedup for ***`stage_for_datasaur` in
`unstructured/staging/datasaur.py`***
⏱️ Runtime : **`1.69 milliseconds`** **→** **`1.56 milliseconds`** (best
of `250` runs)
#### 📝 Explanation and details
The optimization replaces the explicit loop-based result construction
with a **list comprehension**. This change eliminates the intermediate
`result` list initialization and the repeated `append()` operations.
**Key changes:**
- Removed `result: List[Dict[str, Any]] = []` initialization
- Replaced the `for i, item in enumerate(elements):` loop with a single
list comprehension: `return [{"text": item.text, "entities":
_entities[i]} for i, item in enumerate(elements)]`
- Eliminated multiple `result.append(data)` calls
**Why this is faster:**
List comprehensions in Python are implemented in C and execute
significantly faster than equivalent explicit loops with append
operations. The optimization eliminates the overhead of:
- Creating an empty list and growing it incrementally
- Multiple function calls to `append()`
- Temporary variable assignment (`data`)
**Performance characteristics:**
The profiler shows this optimization is most effective for larger
datasets - the annotated tests demonstrate **18-20% speedup** for 1000+
elements, while smaller datasets see modest gains or slight overhead due
to the comprehension setup cost. The optimization delivers consistent
**6-10% improvements** for medium-scale workloads (500+ elements with
entities).
**Impact on workloads:**
This optimization will benefit any application processing substantial
amounts of text data for Datasaur formatting, particularly document
processing pipelines or batch entity annotation workflows where hundreds
or thousands of text elements are processed together.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **6 Passed** |
| 🌀 Generated Regression Tests | ✅ **37 Passed** |
| ⏪ Replay Tests | 🔘 **None Found** |
| 🔎 Concolic Coverage Tests | ✅ **3 Passed** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:--------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `staging/test_datasaur.py::test_datasaur_raises_with_bad_type` |
2.67μs | 2.50μs | 6.64%✅ |
|
`staging/test_datasaur.py::test_datasaur_raises_with_missing_entity_text`
| 1.04μs | 1.04μs | -0.096%⚠️ |
| `staging/test_datasaur.py::test_datasaur_raises_with_missing_key` |
2.08μs | 1.96μs | 6.33%✅ |
| `staging/test_datasaur.py::test_datasaur_raises_with_wrong_length` |
1.08μs | 1.04μs | 4.03%✅ |
| `staging/test_datasaur.py::test_stage_for_datasaur` | 1.29μs | 1.33μs
| -3.08%⚠️ |
| `staging/test_datasaur.py::test_stage_for_datasaur_with_entities` |
2.50μs | 2.46μs | 1.67%✅ |
</details>
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
# imports
import pytest
from unstructured.staging.datasaur import stage_for_datasaur
# Mock class for Text, as per unstructured.documents.elements.Text
class Text:
def __init__(self, text: str):
self.text = text
# unit tests
# ---------------------------
# Basic Test Cases
# ---------------------------
def test_single_element_no_entities():
# Single Text element, no entities
elements = [Text("hello world")]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 1.12μs -> 1.25μs (10.0% slower)
def test_multiple_elements_no_entities():
# Multiple Text elements, no entities
elements = [Text("a"), Text("b"), Text("c")]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 1.38μs -> 1.38μs (0.000% faster)
for i, letter in enumerate(["a", "b", "c"]):
pass
def test_single_element_with_single_entity():
# Single element, one entity
elements = [Text("hello world")]
entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.04μs -> 2.04μs (0.000% faster)
def test_multiple_elements_with_entities():
# Multiple elements, each with entities
elements = [Text("foo bar"), Text("baz qux")]
entities = [
[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}],
[{"text": "qux", "type": "NOUN", "start_idx": 4, "end_idx": 7}],
]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.50μs -> 2.58μs (3.21% slower)
def test_elements_with_mixed_entities():
# Some elements have entities, some do not
elements = [Text("foo bar"), Text("baz qux")]
entities = [[], [{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.08μs -> 2.08μs (0.000% faster)
# ---------------------------
# Edge Test Cases
# ---------------------------
def test_empty_elements_list():
# Empty input list
elements = []
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 791ns -> 875ns (9.60% slower)
def test_entities_length_mismatch():
# entities list length does not match elements length
elements = [Text("foo"), Text("bar")]
entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}]]
with pytest.raises(ValueError) as excinfo:
stage_for_datasaur(elements, entities) # 916ns -> 875ns (4.69% faster)
def test_entity_missing_key():
# Entity is missing a required key
elements = [Text("foo")]
entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0}]] # missing 'end_idx'
with pytest.raises(ValueError) as excinfo:
stage_for_datasaur(elements, entities) # 1.83μs -> 1.75μs (4.74% faster)
def test_entity_wrong_type():
# Entity has wrong type for a key
elements = [Text("foo")]
entities = [
[{"text": "foo", "type": "NOUN", "start_idx": "0", "end_idx": 3}]
] # 'start_idx' should be int
with pytest.raises(ValueError) as excinfo:
stage_for_datasaur(elements, entities) # 2.42μs -> 2.33μs (3.60% faster)
def test_entity_extra_keys():
# Entity has extra keys (should not error)
elements = [Text("foo")]
entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3, "confidence": 0.99}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.00μs -> 2.04μs (2.01% slower)
def test_entities_is_none():
# entities explicitly passed as None
elements = [Text("foo")]
codeflash_output = stage_for_datasaur(elements, None)
result = codeflash_output # 1.04μs -> 1.08μs (3.79% slower)
def test_entity_empty_list():
# entities is a list of empty lists (should be valid)
elements = [Text("foo"), Text("bar")]
entities = [[], []]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 1.42μs -> 1.50μs (5.60% slower)
def test_entity_text_not_matching_element():
# Entity text does not match element text (should not error)
elements = [Text("foobar")]
entities = [[{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.00μs -> 2.00μs (0.000% faster)
def test_entity_indices_out_of_bounds():
# Entity indices out of text bounds (should not error)
elements = [Text("foo")]
entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 10}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 1.96μs -> 2.00μs (2.10% slower)
# ---------------------------
# Large Scale Test Cases
# ---------------------------
def test_large_number_of_elements():
# Test with 1000 elements, no entities
n = 1000
elements = [Text(str(i)) for i in range(n)]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 102μs -> 87.0μs (18.1% faster)
for i in range(n):
pass
def test_large_number_of_elements_with_entities():
# Test with 500 elements, each with one entity
n = 500
elements = [Text(f"text_{i}") for i in range(n)]
entities = [
[{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}]
for i in range(n)
]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 244μs -> 227μs (7.83% faster)
for i in range(n):
pass
def test_large_number_of_entities_per_element():
# Test with 10 elements, each with 100 entities
elements = [Text(f"text_{i}") for i in range(10)]
entities = [
[{"text": f"t_{j}", "type": "TYPE", "start_idx": j, "end_idx": j + 1} for j in range(100)]
for _ in range(10)
]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 356μs -> 337μs (5.73% faster)
for i in range(10):
for j in range(100):
pass
# ---------------------------
# Mutation Testing Guards
# ---------------------------
def test_mutation_guard_wrong_text_key():
# Changing the output key 'text' should fail
elements = [Text("foo")]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 1.00μs -> 1.04μs (4.03% slower)
def test_mutation_guard_wrong_entities_key():
# Changing the output key 'entities' should fail
elements = [Text("foo")]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 958ns -> 1.00μs (4.20% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
# imports
import pytest
from unstructured.staging.datasaur import stage_for_datasaur
# Dummy Text class for testing, since unstructured.documents.elements.Text is not available
class Text:
def __init__(self, text: str):
self.text = text
# unit tests
# --------------------- Basic Test Cases ---------------------
def test_single_element_no_entities():
# One element, no entities
elements = [Text("hello world")]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 1.17μs -> 1.21μs (3.47% slower)
def test_multiple_elements_no_entities():
# Multiple elements, no entities
elements = [Text("foo"), Text("bar"), Text("baz")]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 1.29μs -> 1.33μs (3.15% slower)
def test_single_element_with_valid_entities():
# One element, one valid entity
elements = [Text("hello world")]
entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.04μs -> 2.00μs (2.05% faster)
def test_multiple_elements_with_entities():
# Multiple elements, each with their own entities
elements = [Text("foo bar"), Text("baz qux")]
entities = [
[{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}],
[{"text": "qux", "type": "WORD", "start_idx": 4, "end_idx": 7}],
]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.62μs -> 2.50μs (5.00% faster)
def test_multiple_elements_some_empty_entities():
# Multiple elements, some with no entities
elements = [Text("foo bar"), Text("baz qux")]
entities = [
[],
[{"text": "baz", "type": "WORD", "start_idx": 0, "end_idx": 3}],
]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 2.08μs -> 2.08μs (0.048% slower)
# --------------------- Edge Test Cases ---------------------
def test_empty_elements_list():
# No elements
elements = []
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 750ns -> 875ns (14.3% slower)
def test_empty_elements_with_empty_entities():
# No elements, entities is empty list
elements = []
entities = []
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 833ns -> 1.00μs (16.7% slower)
def test_entities_length_mismatch():
# entities list length does not match elements list length
elements = [Text("foo"), Text("bar")]
entities = [[]] # Should be length 2
with pytest.raises(ValueError) as excinfo:
stage_for_datasaur(elements, entities) # 916ns -> 875ns (4.69% faster)
def test_entity_missing_key():
# Entity dict missing a required key
elements = [Text("foo")]
entities = [[{"text": "foo", "type": "WORD", "start_idx": 0}]] # Missing 'end_idx'
with pytest.raises(ValueError) as excinfo:
stage_for_datasaur(elements, entities) # 1.92μs -> 1.75μs (9.49% faster)
def test_entity_wrong_type():
# Entity dict with wrong type for a key
elements = [Text("foo")]
entities = [
[{"text": "foo", "type": "WORD", "start_idx": "zero", "end_idx": 3}]
] # start_idx should be int
with pytest.raises(ValueError) as excinfo:
stage_for_datasaur(elements, entities) # 2.46μs -> 2.33μs (5.36% faster)
def test_entity_extra_keys():
# Entity dict with extra keys (should be ignored)
elements = [Text("foo")]
entities = [[{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3, "extra": "ignored"}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 1.96μs -> 2.00μs (2.05% slower)
def test_entity_with_empty_string():
# Entity with empty string values (should be allowed)
elements = [Text("")]
entities = [[{"text": "", "type": "", "start_idx": 0, "end_idx": 0}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 1.96μs -> 1.96μs (0.000% faster)
def test_entity_with_negative_indices():
# Entity with negative indices (should be allowed, not validated)
elements = [Text("foo")]
entities = [[{"text": "foo", "type": "WORD", "start_idx": -1, "end_idx": -1}]]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 1.83μs -> 1.88μs (2.24% slower)
# --------------------- Large Scale Test Cases ---------------------
def test_large_number_of_elements_no_entities():
# Large number of elements, no entities
n = 1000
elements = [Text(f"text_{i}") for i in range(n)]
codeflash_output = stage_for_datasaur(elements)
result = codeflash_output # 103μs -> 86.7μs (19.7% faster)
for i in range(n):
pass
def test_large_number_of_elements_with_entities():
# Large number of elements, each with one entity
n = 1000
elements = [Text(f"text_{i}") for i in range(n)]
entities = [
[{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}]
for i in range(n)
]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 502μs -> 470μs (6.85% faster)
for i in range(n):
pass
def test_large_number_of_elements_some_with_entities():
# Large number of elements, only even indices have entities
n = 1000
elements = [Text(f"text_{i}") for i in range(n)]
entities = [
(
[{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}]
if i % 2 == 0
else []
)
for i in range(n)
]
codeflash_output = stage_for_datasaur(elements, entities)
result = codeflash_output # 309μs -> 282μs (9.66% faster)
for i in range(n):
if i % 2 == 0:
pass
else:
pass
# --------------------- Determinism Test ---------------------
def test_determinism():
# Running the function twice with the same input should yield the same result
elements = [Text("foo"), Text("bar")]
entities = [
[{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}],
[{"text": "bar", "type": "WORD", "start_idx": 0, "end_idx": 3}],
]
codeflash_output = stage_for_datasaur(elements, entities)
result1 = codeflash_output # 2.75μs -> 2.67μs (3.15% faster)
codeflash_output = stage_for_datasaur(elements, entities)
result2 = codeflash_output # 1.58μs -> 1.54μs (2.66% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
import pytest
from unstructured.documents.elements import Text
from unstructured.staging.datasaur import stage_for_datasaur
def test_stage_for_datasaur():
stage_for_datasaur(
[
Text(
"",
element_id=None,
coordinates=None,
coordinate_system=None,
metadata=None,
detection_origin="",
embeddings=[],
)
],
entities=[[]],
)
def test_stage_for_datasaur_2():
with pytest.raises(
ValueError,
match="If\\ entities\\ is\\ specified,\\ it\\ must\\ be\\ the\\ same\\ length\\ as\\ elements\\.",
):
stage_for_datasaur([], entities=[[]])
def test_stage_for_datasaur_3():
with pytest.raises(
ValueError,
match="Key\\ 'text'\\ was\\ expected\\ but\\ not\\ present\\ in\\ the\\ Datasaur\\ entity\\.",
):
stage_for_datasaur(
[
Text(
"",
element_id=None,
coordinates=None,
coordinate_system=None,
metadata=None,
detection_origin="",
embeddings=[0.0],
)
],
entities=[[{}, {}]],
)
```
</details>
<details>
<summary>🔎 Concolic Coverage Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:-----------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur`
| 1.29μs | 1.46μs | -11.4%⚠️ |
|
`codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_2`
| 916ns | 959ns | -4.48%⚠️ |
|
`codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_3`
| 1.71μs | 1.67μs | 2.52%✅ |
</details>
To edit these changes `git checkout
codeflash/optimize-stage_for_datasaur-mjdt0e1s` and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>