unstructured
1f32cdaa - ⚡️ Speed up method `OCRAgentTesseract.extract_word_from_hocr` by 35% (#4130)

Commit

103 days ago

⚡️ Speed up method `OCRAgentTesseract.extract_word_from_hocr` by 35% (#4130) Saurabh's comments - This looks like a good, easy straightforward and impactful optimization  #### 📄 35% (0.35x) speedup for ***`OCRAgentTesseract.extract_word_from_hocr` in `unstructured/partition/utils/ocr_models/tesseract_ocr.py`*** ⏱️ Runtime : **`7.18 milliseconds`** **→** **`5.31 milliseconds`** (best of `13` runs) #### 📝 Explanation and details The optimized code achieves a **35% speedup** through two key performance improvements: **1. Regex Precompilation** The original code calls `re.search(r"x_conf (\d+\.\d+)", char_title)` inside the loop, recompiling the regex pattern on every iteration. The optimization moves this to module level as `_RE_X_CONF = re.compile(r"x_conf (\d+\.\d+)")`, compiling it once at import time. The line profiler shows the regex search time improved from 12.73ms (42.9% of total time) to 3.02ms (16.2% of total time) - a **76% reduction** in regex overhead. **2. Efficient String Building** The original code uses string concatenation (`word_text += char`) which creates a new string object each time due to Python's immutable strings. With 6,339 character additions in the profiled run, this becomes expensive. The optimization collects characters in a list (`chars.append(char)`) and builds the final string once with `"".join(chars)`. This reduces the character accumulation overhead from 1.52ms to 1.58ms for appends plus a single 46μs join operation. **Performance Impact** These optimizations are particularly effective for OCR processing where: - The same regex pattern is applied thousands of times per document - Words contain multiple characters that need accumulation - The function is likely called frequently during document processing The 35% speedup directly translates to faster document processing in OCR workflows, with the most significant gains occurring when processing documents with many detected characters that pass the confidence threshold. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **27 Passed** | | 🌀 Generated Regression Tests | ✅ **22 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/pdf_image/test_ocr.py::test_extract_word_from_hocr` | 63.2μs | 49.1μs | 28.7%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python ``` </details> To edit these changes `git checkout codeflash/optimize-OCRAgentTesseract.extract_word_from_hocr-mjcarjk8` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>

References

#4130 - ⚡️ Speed up method `OCRAgentTesseract.extract_word_from_hocr` by 35%

Author

misrasaurabh1

Parents

dce53453

unstructured 1f32cdaa - ⚡️ Speed up method `OCRAgentTesseract.extract_word_from_hocr` by 35% (#4130)

unstructured
1f32cdaa - ⚡️ Speed up method `OCRAgentTesseract.extract_word_from_hocr` by 35% (#4130)