unstructured
a5e206f8 - enhancement: Speed up function `get_bbox_thickness` by 1,267% (#4165)

Commit

13 days ago

enhancement: Speed up function `get_bbox_thickness` by 1,267% (#4165)  #### 📄 1,267% (12.67x) speedup for ***`get_bbox_thickness` in `unstructured/partition/pdf_image/analysis/bbox_visualisation.py`*** ⏱️ Runtime : **`5.01 milliseconds`** **→** **`367 microseconds`** (best of `250` runs) #### 📝 Explanation and details The optimization replaces `np.polyfit` with direct linear interpolation, achieving a **13x speedup** by eliminating unnecessary computational overhead. **Key Optimization:** - **Removed `np.polyfit`**: The original code used NumPy's polynomial fitting for a simple linear interpolation between two points, which is computationally expensive - **Direct linear interpolation**: Replaced with manual slope calculation: `slope = (max_value - min_value) / (ratio_for_max_value - ratio_for_min_value)` **Why This is Faster:** - `np.polyfit` performs general polynomial regression using least squares, involving matrix operations and SVD decomposition - overkill for two points - Direct slope calculation requires only basic arithmetic operations (subtraction and division) - Line profiler shows the `np.polyfit` line consumed 91.7% of execution time (10.67ms out of 11.64ms total) **Performance Impact:** The function is called from `draw_bbox_on_image` which processes bounding boxes for PDF image visualization. Since this appears to be in a rendering pipeline that could process many bounding boxes per page, the 13x speedup significantly improves visualization performance. Test results show consistent 12-13x improvements across all scenarios, from single bbox calls (~25μs → ~2μs) to batch processing of 100 random bboxes (1.6ms → 116μs). **Optimization Benefits:** - **Small bboxes**: 1329% faster (basic cases) - **Large bboxes**: 1283% faster - **Batch processing**: 1297% faster for 100 random bboxes - **Scale-intensive workloads**: 1341% faster for processing 1000+ bboxes This optimization is particularly valuable for PDF processing workflows where many bounding boxes need thickness calculations for visualization. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **8 Passed** | | 🌀 Generated Regression Tests | ✅ **285 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:----------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/pdf_image/test_analysis.py::test_get_bbox_thickness` | 75.5μs | 5.58μs | 1252%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python # imports import pytest # used for our unit tests from unstructured.partition.pdf_image.analysis.bbox_visualisation import get_bbox_thickness # unit tests # ---------- BASIC TEST CASES ---------- def test_basic_small_bbox_returns_min_thickness(): # Small bbox on a normal page should return min_thickness bbox = (10, 10, 20, 20) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 30.4μs -> 2.12μs (1329% faster) def test_basic_large_bbox_returns_max_thickness(): # Large bbox close to page size should return max_thickness bbox = (0, 0, 950, 950) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 27.1μs -> 1.96μs (1283% faster) def test_basic_medium_bbox_returns_intermediate_thickness(): # Medium bbox should return a value between min and max bbox = (100, 100, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.4μs -> 1.88μs (1256% faster) def test_basic_custom_min_max_thickness(): # Test with custom min and max thickness bbox = (0, 0, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=2, max_thickness=8) result = codeflash_output # 25.5μs -> 2.00μs (1175% faster) # ---------- EDGE TEST CASES ---------- def test_zero_area_bbox(): # Bbox with zero area (x1==x2 and y1==y2) should return min_thickness bbox = (100, 100, 100, 100) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.2μs -> 1.92μs (1214% faster) def test_bbox_exceeds_page_size(): # Bbox larger than page should still clamp to max_thickness bbox = (-100, -100, 1200, 1200) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.0μs -> 1.83μs (1264% faster) def test_negative_coordinates_bbox(): # Bbox with negative coordinates should still work bbox = (-10, -10, 20, 20) page_size = (100, 100) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.0μs -> 1.92μs (1205% faster) def test_min_equals_max_thickness(): # If min_thickness == max_thickness, always return that value bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=3, max_thickness=3) result = codeflash_output # 24.9μs -> 2.04μs (1119% faster) def test_page_size_zero_raises(): # Page size of zero should raise ZeroDivisionError bbox = (0, 0, 10, 10) page_size = (0, 0) with pytest.raises(ZeroDivisionError): get_bbox_thickness(bbox, page_size) # 1.96μs -> 1.88μs (4.43% faster) def test_bbox_on_line(): # Bbox that's a line (x1==x2 or y1==y2) should return min_thickness bbox = (10, 10, 10, 100) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.4μs -> 2.04μs (1143% faster) def test_min_thickness_greater_than_max_thickness(): # If min_thickness > max_thickness, function should clamp to min_thickness bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=5, max_thickness=2) result = codeflash_output # 24.9μs -> 2.00μs (1146% faster) # ---------- LARGE SCALE TEST CASES ---------- def test_many_bboxes_scaling(): # Test with 1000 bboxes of increasing size page_size = (1000, 1000) min_thickness, max_thickness = 1, 8 for i in range(1, 1001, 100): # 10 steps to keep runtime reasonable bbox = (0, 0, i, i) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness, max_thickness) result = codeflash_output # 181μs -> 12.9μs (1307% faster) def test_large_page_and_bbox(): # Test with large page and bbox values bbox = (0, 0, 999_999, 999_999) page_size = (1_000_000, 1_000_000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 24.2μs -> 2.08μs (1064% faster) def test_randomized_bboxes(): # Test with random bboxes within a page, ensure all results in bounds import random page_size = (1000, 1000) min_thickness, max_thickness = 1, 4 for _ in range(100): x1 = random.randint(0, 900) y1 = random.randint(0, 900) x2 = random.randint(x1, 1000) y2 = random.randint(y1, 1000) bbox = (x1, y1, x2, y2) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness, max_thickness) result = codeflash_output # 1.64ms -> 117μs (1297% faster) def test_performance_large_number_of_calls(): # Ensure function does not degrade with many calls (not a timing test, just functional) page_size = (500, 500) for i in range(1, 1001, 100): # 10 steps bbox = (0, 0, i, i) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 173μs -> 12.7μs (1264% faster) # ---------- ADDITIONAL EDGE CASES ---------- def test_bbox_with_float_coordinates(): # Non-integer coordinates should still work (since function expects int, but let's see) bbox = (0.0, 0.0, 500.0, 500.0) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(tuple(map(int, bbox)), page_size) result = codeflash_output # 24.0μs -> 1.88μs (1178% faster) def test_bbox_equal_to_page(): # Bbox exactly same as page should return max_thickness bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 23.8μs -> 1.83μs (1200% faster) def test_bbox_minimal_size(): # Bbox of size 1x1 should return min_thickness bbox = (10, 10, 11, 11) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 23.9μs -> 1.88μs (1176% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # imports import pytest # used for our unit tests from unstructured.partition.pdf_image.analysis.bbox_visualisation import get_bbox_thickness # unit tests # ---------------------- BASIC TEST CASES ---------------------- def test_basic_small_bbox_min_thickness(): # Very small bbox compared to page, should get min_thickness bbox = (10, 10, 20, 20) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 24.1μs -> 1.88μs (1184% faster) def test_basic_large_bbox_max_thickness(): # Very large bbox, nearly the page size, should get max_thickness bbox = (0, 0, 900, 900) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.9μs -> 1.79μs (1235% faster) def test_basic_middle_bbox(): # Bbox size between min and max, should interpolate bbox = (100, 100, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 23.9μs -> 1.83μs (1205% faster) def test_basic_non_square_bbox(): # Non-square bbox, checks diagonal calculation bbox = (10, 10, 110, 410) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 24.0μs -> 1.83μs (1207% faster) def test_basic_custom_thickness_range(): # Custom min/max thickness values bbox = (0, 0, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness( bbox, page_size, min_thickness=2, max_thickness=8 ) # 24.0μs -> 1.92μs (1155% faster) # ---------------------- EDGE TEST CASES ---------------------- def test_edge_bbox_zero_size(): # Zero-area bbox, should always return min_thickness bbox = (100, 100, 100, 100) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 24.0μs -> 1.83μs (1209% faster) def test_edge_bbox_full_page(): # Bbox covers the whole page, should return max_thickness bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.9μs -> 1.83μs (1205% faster) def test_edge_bbox_negative_coordinates(): # Bbox with negative coordinates, still valid diagonal bbox = (-50, -50, 50, 50) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 23.9μs -> 1.83μs (1203% faster) def test_edge_bbox_larger_than_page(): # Bbox larger than page, should clamp to max_thickness bbox = (-100, -100, 1200, 1200) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.8μs -> 1.79μs (1228% faster) def test_edge_min_greater_than_max(): # min_thickness > max_thickness, should always return min_thickness (clamped) bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness( bbox, page_size, min_thickness=5, max_thickness=2 ) # 24.1μs -> 1.92μs (1156% faster) def test_edge_zero_page_size(): # Page size zero, should raise ZeroDivisionError bbox = (0, 0, 10, 10) page_size = (0, 0) with pytest.raises(ZeroDivisionError): get_bbox_thickness(bbox, page_size) # 1.88μs -> 1.75μs (7.14% faster) def test_edge_bbox_on_page_border(): # Bbox on the edge of the page, not exceeding bounds bbox = (0, 0, 1000, 10) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 24.8μs -> 2.00μs (1138% faster) def test_edge_non_integer_bbox_and_page(): # Bbox and page_size with float values, should still work bbox = (0.0, 0.0, 500.5, 500.5) page_size = (1000.0, 1000.0) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 23.9μs -> 1.54μs (1448% faster) def test_edge_bbox_swapped_coordinates(): # Bbox with x2 < x1 or y2 < y1, negative width/height bbox = (100, 100, 50, 50) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 24.3μs -> 1.96μs (1143% faster) # ---------------------- LARGE SCALE TEST CASES ---------------------- def test_large_scale_many_bboxes(): # Test many bboxes on a large page page_size = (10000, 10000) for i in range(1, 1001, 100): # 10 iterations, up to 1000 bbox = (i, i, i + 100, i + 100) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 177μs -> 12.3μs (1341% faster) def test_large_scale_increasing_bbox_size(): # Test increasing bbox sizes from tiny to almost page size page_size = (1000, 1000) for size in range(1, 1001, 100): bbox = (0, 0, size, size) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 173μs -> 12.7μs (1263% faster) # Should be monotonic non-decreasing if size > 1: codeflash_output = get_bbox_thickness((0, 0, size - 100, size - 100), page_size) prev_thickness = codeflash_output def test_large_scale_random_bboxes(): # Generate 100 random bboxes and check thickness is in range import random page_size = (1000, 1000) for _ in range(100): x1 = random.randint(0, 900) y1 = random.randint(0, 900) x2 = random.randint(x1, 1000) y2 = random.randint(y1, 1000) bbox = (x1, y1, x2, y2) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 1.63ms -> 116μs (1296% faster) def test_large_scale_extreme_aspect_ratios(): # Very thin or very flat bboxes page_size = (1000, 1000) # Very thin vertical bbox = (500, 0, 501, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.8μs -> 1.88μs (1167% faster) # Very thin horizontal bbox = (0, 500, 1000, 501) codeflash_output = get_bbox_thickness(bbox, page_size) # 18.3μs -> 1.38μs (1230% faster) def test_large_scale_varied_thickness_range(): # Test with large min/max thickness range page_size = (1000, 1000) for size in range(1, 1001, 200): bbox = (0, 0, size, size) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=10, max_thickness=100) thickness = codeflash_output # 93.3μs -> 7.17μs (1202% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-get_bbox_thickness-mjdlipbj` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io>

References

#4165 - ⚡️ Speed up function `get_bbox_thickness` by 1,267%

Author

misrasaurabh1

Parents

6895f118

unstructured a5e206f8 - enhancement: Speed up function `get_bbox_thickness` by 1,267% (#4165)

unstructured
a5e206f8 - enhancement: Speed up function `get_bbox_thickness` by 1,267% (#4165)