enhancement: Speed up function `get_bbox_thickness` by 1,267% (#4165)
<!-- CODEFLASH_OPTIMIZATION:
{"function":"get_bbox_thickness","file":"unstructured/partition/pdf_image/analysis/bbox_visualisation.py","speedup_pct":"1,267%","speedup_x":"12.67x","original_runtime":"5.01
milliseconds","best_runtime":"367
microseconds","optimization_type":"general","timestamp":"2025-12-20T01:04:43.833Z","version":"1.0"}
-->
#### 📄 1,267% (12.67x) speedup for ***`get_bbox_thickness` in
`unstructured/partition/pdf_image/analysis/bbox_visualisation.py`***
⏱️ Runtime : **`5.01 milliseconds`** **→** **`367 microseconds`** (best
of `250` runs)
#### 📝 Explanation and details
The optimization replaces `np.polyfit` with direct linear interpolation,
achieving a **13x speedup** by eliminating unnecessary computational
overhead.
**Key Optimization:**
- **Removed `np.polyfit`**: The original code used NumPy's polynomial
fitting for a simple linear interpolation between two points, which is
computationally expensive
- **Direct linear interpolation**: Replaced with manual slope
calculation: `slope = (max_value - min_value) / (ratio_for_max_value -
ratio_for_min_value)`
**Why This is Faster:**
- `np.polyfit` performs general polynomial regression using least
squares, involving matrix operations and SVD decomposition - overkill
for two points
- Direct slope calculation requires only basic arithmetic operations
(subtraction and division)
- Line profiler shows the `np.polyfit` line consumed 91.7% of execution
time (10.67ms out of 11.64ms total)
**Performance Impact:**
The function is called from `draw_bbox_on_image` which processes
bounding boxes for PDF image visualization. Since this appears to be in
a rendering pipeline that could process many bounding boxes per page,
the 13x speedup significantly improves visualization performance. Test
results show consistent 12-13x improvements across all scenarios, from
single bbox calls (~25μs → ~2μs) to batch processing of 100 random
bboxes (1.6ms → 116μs).
**Optimization Benefits:**
- **Small bboxes**: 1329% faster (basic cases)
- **Large bboxes**: 1283% faster
- **Batch processing**: 1297% faster for 100 random bboxes
- **Scale-intensive workloads**: 1341% faster for processing 1000+
bboxes
This optimization is particularly valuable for PDF processing workflows
where many bounding boxes need thickness calculations for visualization.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **8 Passed** |
| 🌀 Generated Regression Tests | ✅ **285 Passed** |
| ⏪ Replay Tests | 🔘 **None Found** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:----------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/pdf_image/test_analysis.py::test_get_bbox_thickness` |
75.5μs | 5.58μs | 1252%✅ |
</details>
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
# imports
import pytest # used for our unit tests
from unstructured.partition.pdf_image.analysis.bbox_visualisation import get_bbox_thickness
# unit tests
# ---------- BASIC TEST CASES ----------
def test_basic_small_bbox_returns_min_thickness():
# Small bbox on a normal page should return min_thickness
bbox = (10, 10, 20, 20)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 30.4μs -> 2.12μs (1329% faster)
def test_basic_large_bbox_returns_max_thickness():
# Large bbox close to page size should return max_thickness
bbox = (0, 0, 950, 950)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 27.1μs -> 1.96μs (1283% faster)
def test_basic_medium_bbox_returns_intermediate_thickness():
# Medium bbox should return a value between min and max
bbox = (100, 100, 500, 500)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 25.4μs -> 1.88μs (1256% faster)
def test_basic_custom_min_max_thickness():
# Test with custom min and max thickness
bbox = (0, 0, 500, 500)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=2, max_thickness=8)
result = codeflash_output # 25.5μs -> 2.00μs (1175% faster)
# ---------- EDGE TEST CASES ----------
def test_zero_area_bbox():
# Bbox with zero area (x1==x2 and y1==y2) should return min_thickness
bbox = (100, 100, 100, 100)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 25.2μs -> 1.92μs (1214% faster)
def test_bbox_exceeds_page_size():
# Bbox larger than page should still clamp to max_thickness
bbox = (-100, -100, 1200, 1200)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 25.0μs -> 1.83μs (1264% faster)
def test_negative_coordinates_bbox():
# Bbox with negative coordinates should still work
bbox = (-10, -10, 20, 20)
page_size = (100, 100)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 25.0μs -> 1.92μs (1205% faster)
def test_min_equals_max_thickness():
# If min_thickness == max_thickness, always return that value
bbox = (0, 0, 1000, 1000)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=3, max_thickness=3)
result = codeflash_output # 24.9μs -> 2.04μs (1119% faster)
def test_page_size_zero_raises():
# Page size of zero should raise ZeroDivisionError
bbox = (0, 0, 10, 10)
page_size = (0, 0)
with pytest.raises(ZeroDivisionError):
get_bbox_thickness(bbox, page_size) # 1.96μs -> 1.88μs (4.43% faster)
def test_bbox_on_line():
# Bbox that's a line (x1==x2 or y1==y2) should return min_thickness
bbox = (10, 10, 10, 100)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 25.4μs -> 2.04μs (1143% faster)
def test_min_thickness_greater_than_max_thickness():
# If min_thickness > max_thickness, function should clamp to min_thickness
bbox = (0, 0, 1000, 1000)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=5, max_thickness=2)
result = codeflash_output # 24.9μs -> 2.00μs (1146% faster)
# ---------- LARGE SCALE TEST CASES ----------
def test_many_bboxes_scaling():
# Test with 1000 bboxes of increasing size
page_size = (1000, 1000)
min_thickness, max_thickness = 1, 8
for i in range(1, 1001, 100): # 10 steps to keep runtime reasonable
bbox = (0, 0, i, i)
codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness, max_thickness)
result = codeflash_output # 181μs -> 12.9μs (1307% faster)
def test_large_page_and_bbox():
# Test with large page and bbox values
bbox = (0, 0, 999_999, 999_999)
page_size = (1_000_000, 1_000_000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 24.2μs -> 2.08μs (1064% faster)
def test_randomized_bboxes():
# Test with random bboxes within a page, ensure all results in bounds
import random
page_size = (1000, 1000)
min_thickness, max_thickness = 1, 4
for _ in range(100):
x1 = random.randint(0, 900)
y1 = random.randint(0, 900)
x2 = random.randint(x1, 1000)
y2 = random.randint(y1, 1000)
bbox = (x1, y1, x2, y2)
codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness, max_thickness)
result = codeflash_output # 1.64ms -> 117μs (1297% faster)
def test_performance_large_number_of_calls():
# Ensure function does not degrade with many calls (not a timing test, just functional)
page_size = (500, 500)
for i in range(1, 1001, 100): # 10 steps
bbox = (0, 0, i, i)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 173μs -> 12.7μs (1264% faster)
# ---------- ADDITIONAL EDGE CASES ----------
def test_bbox_with_float_coordinates():
# Non-integer coordinates should still work (since function expects int, but let's see)
bbox = (0.0, 0.0, 500.0, 500.0)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(tuple(map(int, bbox)), page_size)
result = codeflash_output # 24.0μs -> 1.88μs (1178% faster)
def test_bbox_equal_to_page():
# Bbox exactly same as page should return max_thickness
bbox = (0, 0, 1000, 1000)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 23.8μs -> 1.83μs (1200% faster)
def test_bbox_minimal_size():
# Bbox of size 1x1 should return min_thickness
bbox = (10, 10, 11, 11)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
result = codeflash_output # 23.9μs -> 1.88μs (1176% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
# imports
import pytest # used for our unit tests
from unstructured.partition.pdf_image.analysis.bbox_visualisation import get_bbox_thickness
# unit tests
# ---------------------- BASIC TEST CASES ----------------------
def test_basic_small_bbox_min_thickness():
# Very small bbox compared to page, should get min_thickness
bbox = (10, 10, 20, 20)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size) # 24.1μs -> 1.88μs (1184% faster)
def test_basic_large_bbox_max_thickness():
# Very large bbox, nearly the page size, should get max_thickness
bbox = (0, 0, 900, 900)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size) # 23.9μs -> 1.79μs (1235% faster)
def test_basic_middle_bbox():
# Bbox size between min and max, should interpolate
bbox = (100, 100, 500, 500)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 23.9μs -> 1.83μs (1205% faster)
def test_basic_non_square_bbox():
# Non-square bbox, checks diagonal calculation
bbox = (10, 10, 110, 410)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 24.0μs -> 1.83μs (1207% faster)
def test_basic_custom_thickness_range():
# Custom min/max thickness values
bbox = (0, 0, 500, 500)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(
bbox, page_size, min_thickness=2, max_thickness=8
) # 24.0μs -> 1.92μs (1155% faster)
# ---------------------- EDGE TEST CASES ----------------------
def test_edge_bbox_zero_size():
# Zero-area bbox, should always return min_thickness
bbox = (100, 100, 100, 100)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size) # 24.0μs -> 1.83μs (1209% faster)
def test_edge_bbox_full_page():
# Bbox covers the whole page, should return max_thickness
bbox = (0, 0, 1000, 1000)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size) # 23.9μs -> 1.83μs (1205% faster)
def test_edge_bbox_negative_coordinates():
# Bbox with negative coordinates, still valid diagonal
bbox = (-50, -50, 50, 50)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 23.9μs -> 1.83μs (1203% faster)
def test_edge_bbox_larger_than_page():
# Bbox larger than page, should clamp to max_thickness
bbox = (-100, -100, 1200, 1200)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size) # 23.8μs -> 1.79μs (1228% faster)
def test_edge_min_greater_than_max():
# min_thickness > max_thickness, should always return min_thickness (clamped)
bbox = (0, 0, 1000, 1000)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(
bbox, page_size, min_thickness=5, max_thickness=2
) # 24.1μs -> 1.92μs (1156% faster)
def test_edge_zero_page_size():
# Page size zero, should raise ZeroDivisionError
bbox = (0, 0, 10, 10)
page_size = (0, 0)
with pytest.raises(ZeroDivisionError):
get_bbox_thickness(bbox, page_size) # 1.88μs -> 1.75μs (7.14% faster)
def test_edge_bbox_on_page_border():
# Bbox on the edge of the page, not exceeding bounds
bbox = (0, 0, 1000, 10)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 24.8μs -> 2.00μs (1138% faster)
def test_edge_non_integer_bbox_and_page():
# Bbox and page_size with float values, should still work
bbox = (0.0, 0.0, 500.5, 500.5)
page_size = (1000.0, 1000.0)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 23.9μs -> 1.54μs (1448% faster)
def test_edge_bbox_swapped_coordinates():
# Bbox with x2 < x1 or y2 < y1, negative width/height
bbox = (100, 100, 50, 50)
page_size = (1000, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 24.3μs -> 1.96μs (1143% faster)
# ---------------------- LARGE SCALE TEST CASES ----------------------
def test_large_scale_many_bboxes():
# Test many bboxes on a large page
page_size = (10000, 10000)
for i in range(1, 1001, 100): # 10 iterations, up to 1000
bbox = (i, i, i + 100, i + 100)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 177μs -> 12.3μs (1341% faster)
def test_large_scale_increasing_bbox_size():
# Test increasing bbox sizes from tiny to almost page size
page_size = (1000, 1000)
for size in range(1, 1001, 100):
bbox = (0, 0, size, size)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 173μs -> 12.7μs (1263% faster)
# Should be monotonic non-decreasing
if size > 1:
codeflash_output = get_bbox_thickness((0, 0, size - 100, size - 100), page_size)
prev_thickness = codeflash_output
def test_large_scale_random_bboxes():
# Generate 100 random bboxes and check thickness is in range
import random
page_size = (1000, 1000)
for _ in range(100):
x1 = random.randint(0, 900)
y1 = random.randint(0, 900)
x2 = random.randint(x1, 1000)
y2 = random.randint(y1, 1000)
bbox = (x1, y1, x2, y2)
codeflash_output = get_bbox_thickness(bbox, page_size)
thickness = codeflash_output # 1.63ms -> 116μs (1296% faster)
def test_large_scale_extreme_aspect_ratios():
# Very thin or very flat bboxes
page_size = (1000, 1000)
# Very thin vertical
bbox = (500, 0, 501, 1000)
codeflash_output = get_bbox_thickness(bbox, page_size) # 23.8μs -> 1.88μs (1167% faster)
# Very thin horizontal
bbox = (0, 500, 1000, 501)
codeflash_output = get_bbox_thickness(bbox, page_size) # 18.3μs -> 1.38μs (1230% faster)
def test_large_scale_varied_thickness_range():
# Test with large min/max thickness range
page_size = (1000, 1000)
for size in range(1, 1001, 200):
bbox = (0, 0, size, size)
codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=10, max_thickness=100)
thickness = codeflash_output # 93.3μs -> 7.17μs (1202% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
</details>
To edit these changes `git checkout
codeflash/optimize-get_bbox_thickness-mjdlipbj` and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: Alan Bertl <alan@unstructured.io>