⚡️ Speed up method `OCRAgentTesseract.extract_word_from_hocr` by 35% (#4130)
Saurabh's comments - This looks like a good, easy straightforward and
impactful optimization
<!-- CODEFLASH_OPTIMIZATION:
{"function":"OCRAgentTesseract.extract_word_from_hocr","file":"unstructured/partition/utils/ocr_models/tesseract_ocr.py","speedup_pct":"35%","speedup_x":"0.35x","original_runtime":"7.18
milliseconds","best_runtime":"5.31
milliseconds","optimization_type":"loop","timestamp":"2025-12-19T03:15:54.368Z","version":"1.0"}
-->
#### 📄 35% (0.35x) speedup for
***`OCRAgentTesseract.extract_word_from_hocr` in
`unstructured/partition/utils/ocr_models/tesseract_ocr.py`***
⏱️ Runtime : **`7.18 milliseconds`** **→** **`5.31 milliseconds`** (best
of `13` runs)
#### 📝 Explanation and details
The optimized code achieves a **35% speedup** through two key
performance improvements:
**1. Regex Precompilation**
The original code calls `re.search(r"x_conf (\d+\.\d+)", char_title)`
inside the loop, recompiling the regex pattern on every iteration. The
optimization moves this to module level as `_RE_X_CONF =
re.compile(r"x_conf (\d+\.\d+)")`, compiling it once at import time. The
line profiler shows the regex search time improved from 12.73ms (42.9%
of total time) to 3.02ms (16.2% of total time) - a **76% reduction** in
regex overhead.
**2. Efficient String Building**
The original code uses string concatenation (`word_text += char`) which
creates a new string object each time due to Python's immutable strings.
With 6,339 character additions in the profiled run, this becomes
expensive. The optimization collects characters in a list
(`chars.append(char)`) and builds the final string once with
`"".join(chars)`. This reduces the character accumulation overhead from
1.52ms to 1.58ms for appends plus a single 46μs join operation.
**Performance Impact**
These optimizations are particularly effective for OCR processing where:
- The same regex pattern is applied thousands of times per document
- Words contain multiple characters that need accumulation
- The function is likely called frequently during document processing
The 35% speedup directly translates to faster document processing in OCR
workflows, with the most significant gains occurring when processing
documents with many detected characters that pass the confidence
threshold.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **27 Passed** |
| 🌀 Generated Regression Tests | ✅ **22 Passed** |
| ⏪ Replay Tests | 🔘 **None Found** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:---------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/pdf_image/test_ocr.py::test_extract_word_from_hocr` |
63.2μs | 49.1μs | 28.7%✅ |
</details>
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
```
</details>
To edit these changes `git checkout
codeflash/optimize-OCRAgentTesseract.extract_word_from_hocr-mjcarjk8`
and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>