enhancement: Speed up method `_DocxPartitioner._parse_category_depth_by_style_name` by 69% (#4174)
<!-- CODEFLASH_OPTIMIZATION:
{"function":"_DocxPartitioner._parse_category_depth_by_style_name","file":"unstructured/partition/docx.py","speedup_pct":"69%","speedup_x":"0.69x","original_runtime":"8.62
milliseconds","best_runtime":"5.11
milliseconds","optimization_type":"loop","timestamp":"2025-08-22T21:02:58.781Z","version":"1.0"}
-->
### 📄 69% (0.69x) speedup for
***`_DocxPartitioner._parse_category_depth_by_style_name` in
`unstructured/partition/docx.py`***
⏱️ Runtime : **`8.62 milliseconds`** **→** **`5.11 milliseconds`** (best
of `17` runs)
### 📝 Explanation and details
The optimized code achieves a **68% speedup** through two key
optimizations:
**1. Tuple-based prefix matching:** Changed `list_prefixes` from a list
to a tuple and replaced the `any()` loop with a single
`str.startswith()` call that accepts multiple prefixes. This eliminates
the overhead of creating a generator expression and iterating through
prefixes one by one. The line profiler shows this optimization reduced
the time spent on prefix matching from 39.4% to 10.9% of total execution
time.
**2. Cached string splitting in `_extract_number()`:** Instead of
calling `suffix.split()` twice (once to check the last element and once
to extract it), the result is now cached in a `parts` variable. This
eliminates redundant string operations when extracting numbers from
style names.
**Performance characteristics by test case:**
- **List styles see the biggest gains** (43-69% faster): The tuple-based
prefix matching is most effective here since these styles require prefix
checking
- **Non-matching styles improve dramatically** (65-151% faster): These
benefit from faster rejection through the optimized prefix check
- **Heading styles show modest gains** (2-33% faster): These bypass the
list prefix logic, so improvements come mainly from the cached splitting
- **Large-scale tests demonstrate consistent speedup** (20-69% faster):
The optimizations scale well with volume
The optimizations are particularly effective for documents with many
list-style elements or diverse style names that don't match any
prefixes.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **87 Passed** |
| 🌀 Generated Regression Tests | ✅ **5555 Passed** |
| ⏪ Replay Tests | ✅ **13 Passed** |
| 🔎 Concolic Coverage Tests | ✅ **6 Passed** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:------------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/test_docx.py::test_parse_category_depth_by_style_name` |
24.5μs | 17.3μs | 41.7%✅ |
</details>
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
from __future__ import annotations
# imports
import pytest
from unstructured.partition.docx import _DocxPartitioner
# unit tests
@pytest.fixture
def partitioner():
# Provide a partitioner instance for use in tests
return _DocxPartitioner()
# --------------------------
# 1. Basic Test Cases
# --------------------------
def test_heading_level_1(partitioner):
# Heading 1 should map to depth 0
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1")
def test_heading_level_2(partitioner):
# Heading 2 should map to depth 1
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2")
def test_heading_level_10(partitioner):
# Heading 10 should map to depth 9
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 10")
def test_subtitle(partitioner):
# Subtitle should map to depth 1
codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle")
def test_list_bullet_1(partitioner):
# List Bullet 1 should map to depth 0
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 1")
def test_list_bullet_3(partitioner):
# List Bullet 3 should map to depth 2
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 3")
def test_list_number_2(partitioner):
# List Number 2 should map to depth 1
codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 2")
def test_list_continue_5(partitioner):
# List Continue 5 should map to depth 4
codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 5")
def test_list_plain(partitioner):
# "List" without a number should map to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("List")
def test_normal_style(partitioner):
# Any non-special style should map to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("Normal")
def test_random_style(partitioner):
# Unknown style name should map to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("RandomStyle")
# --------------------------
# 2. Edge Test Cases
# --------------------------
def test_heading_with_extra_spaces(partitioner):
# Heading with extra spaces should still parse the last word as number if possible
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 3")
def test_heading_without_number(partitioner):
# Heading with no number should map to 0 (since no number to subtract 1)
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading")
def test_list_bullet_with_non_digit_suffix(partitioner):
# List Bullet with non-digit at end should map to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet foo")
def test_list_number_with_large_number(partitioner):
# List Number with a large number
codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 999")
def test_empty_string(partitioner):
# Empty string should map to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("")
def test_case_sensitivity(partitioner):
# Should be case-sensitive: "heading 1" does not match "Heading"
codeflash_output = partitioner._parse_category_depth_by_style_name("heading 1")
def test_subtitle_case(partitioner):
# "subtitle" (lowercase) should not match "Subtitle"
codeflash_output = partitioner._parse_category_depth_by_style_name("subtitle")
def test_list_bullet_with_multiple_spaces(partitioner):
# List Bullet with multiple spaces before number
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 2")
def test_style_name_with_trailing_space(partitioner):
# Style name with trailing space
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 4 ")
def test_style_name_with_leading_space(partitioner):
# Style name with leading space
codeflash_output = partitioner._parse_category_depth_by_style_name(" List Bullet 2")
def test_style_name_with_internal_non_digit(partitioner):
# Heading with non-digit in the number position
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading X")
def test_style_name_with_number_in_middle(partitioner):
# Only the last word is checked for a digit
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2 Extra")
def test_list_continue_with_no_number(partitioner):
# List Continue with no number should map to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue")
def test_style_name_with_special_characters(partitioner):
# Style name with special characters should not break function
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading #$%")
def test_list_prefix_overlap(partitioner):
# "List BulletPoint 2" does not match any valid prefix, so should map to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("List BulletPoint 2")
# --------------------------
# 3. Large Scale Test Cases
# --------------------------
def test_many_headings(partitioner):
# Test a large number of headings, up to 1000
for i in range(1, 1001):
# "Heading N" should map to N-1
codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}")
def test_many_list_bullets(partitioner):
# Test a large number of list bullets, up to 1000
for i in range(1, 1001):
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}")
def test_many_list_numbers(partitioner):
# Test a large number of list numbers, up to 1000
for i in range(1, 1001):
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}")
def test_mixed_styles_large_scale(partitioner):
# Mix a large number of different style names, including edge cases
for i in range(1, 501):
# Headings
codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}")
# List Bullets
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}")
# List Number
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}")
# List Continue
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Continue {i}")
# Unknown style
codeflash_output = partitioner._parse_category_depth_by_style_name(f"Unknown {i}")
# Heading with non-digit
codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}x")
def test_large_scale_with_unusual_inputs(partitioner):
# Test 1000 random/edge case style names
for i in range(1, 1001):
# Style with only number
codeflash_output = partitioner._parse_category_depth_by_style_name(str(i))
# Style with number at start
codeflash_output = partitioner._parse_category_depth_by_style_name(f"{i} Heading")
# Style with number in middle
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List {i} Bullet")
# Style with extra spaces
codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations
# imports
import pytest # used for our unit tests
from unstructured.partition.docx import _DocxPartitioner
# function to test
# pyright: reportPrivateUsage=false
class DocxPartitionerOptions:
pass
from unstructured.partition.docx import _DocxPartitioner
# unit tests
@pytest.fixture
def partitioner():
# Fixture to create a _DocxPartitioner instance
return _DocxPartitioner(DocxPartitionerOptions())
# 1. Basic Test Cases
def test_heading_styles_basic(partitioner):
# Test standard heading styles
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1") # 4.50μs -> 4.41μs (2.16% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") # 1.45μs -> 1.41μs (2.62% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 3") # 1.23μs -> 1.07μs (14.6% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 10") # 1.45μs -> 1.09μs (33.2% faster)
def test_subtitle_style(partitioner):
# Test the special case for 'Subtitle'
codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") # 1.97μs -> 1.89μs (4.18% faster)
def test_list_styles_basic(partitioner):
# Test basic list styles
codeflash_output = partitioner._parse_category_depth_by_style_name("List 1") # 6.28μs -> 4.37μs (43.6% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List 2") # 2.53μs -> 1.59μs (59.8% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List 10") # 2.36μs -> 1.47μs (60.8% faster)
def test_list_bullet_styles(partitioner):
# Test 'List Bullet' styles
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 1") # 6.13μs -> 4.47μs (37.1% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 2") # 2.53μs -> 1.71μs (47.9% faster)
def test_list_continue_styles(partitioner):
# Test 'List Continue' styles
codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 1") # 6.34μs -> 4.41μs (43.9% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 5") # 2.59μs -> 1.75μs (48.1% faster)
def test_list_number_styles(partitioner):
# Test 'List Number' styles
codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 1") # 6.25μs -> 4.34μs (44.0% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 3") # 2.54μs -> 1.72μs (48.2% faster)
def test_other_styles_default_to_zero(partitioner):
# Test styles that should default to 0
codeflash_output = partitioner._parse_category_depth_by_style_name("Normal") # 4.09μs -> 2.48μs (65.0% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Body Text") # 1.94μs -> 913ns (113% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Title") # 1.65μs -> 728ns (127% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Random Style") # 1.58μs -> 727ns (117% faster)
# 2. Edge Test Cases
def test_heading_without_number(partitioner):
# Test 'Heading' with no number
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading") # 3.04μs -> 3.10μs (1.97% slower)
def test_list_without_number(partitioner):
# Test 'List' with no number
codeflash_output = partitioner._parse_category_depth_by_style_name("List") # 5.24μs -> 3.63μs (44.4% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet") # 2.56μs -> 1.72μs (49.1% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue") # 1.66μs -> 1.08μs (53.4% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Number") # 1.52μs -> 932ns (63.2% faster)
def test_heading_with_non_numeric_suffix(partitioner):
# Test 'Heading' with a non-numeric suffix
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading One") # 3.37μs -> 3.44μs (1.95% slower)
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading X") # 1.36μs -> 1.32μs (2.81% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1A") # 953ns -> 983ns (3.05% slower)
def test_list_with_non_numeric_suffix(partitioner):
# Test 'List' with a non-numeric suffix
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet X") # 5.65μs -> 3.98μs (42.2% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue A") # 2.24μs -> 1.61μs (38.9% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Number Foo") # 1.76μs -> 1.22μs (44.1% faster)
def test_case_sensitivity(partitioner):
# Test that style names are case-sensitive
codeflash_output = partitioner._parse_category_depth_by_style_name("heading 1") # 3.98μs -> 2.35μs (69.0% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("HEADING 1") # 2.01μs -> 935ns (115% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") # 665ns -> 591ns (12.5% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("subtitle") # 1.71μs -> 803ns (113% faster)
def test_empty_and_whitespace_styles(partitioner):
# Test empty string and whitespace-only style names
codeflash_output = partitioner._parse_category_depth_by_style_name("") # 4.14μs -> 2.40μs (72.4% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name(" ") # 1.94μs -> 808ns (139% faster)
def test_style_name_with_extra_spaces(partitioner):
# Test style names with extra spaces
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") # 3.79μs -> 3.78μs (0.371% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 3") # 4.11μs -> 2.14μs (92.0% faster)
def test_style_name_with_leading_trailing_spaces(partitioner):
# Test style names with leading/trailing spaces
codeflash_output = partitioner._parse_category_depth_by_style_name(" Heading 1") # 3.97μs -> 2.44μs (62.9% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List 2 ") # 4.34μs -> 3.08μs (41.2% faster)
def test_style_name_with_multiple_words(partitioner):
# Test style names with multiple words that don't match any prefix
codeflash_output = partitioner._parse_category_depth_by_style_name("My Custom Heading 1") # 3.81μs -> 2.31μs (65.1% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet Special 2") # 4.46μs -> 3.08μs (45.0% faster)
def test_style_name_with_large_number(partitioner):
# Test styles with very large numbers
codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 999") # 4.21μs -> 4.06μs (3.64% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name("List 1000") # 4.09μs -> 2.19μs (86.9% faster)
# 3. Large Scale Test Cases
def test_large_number_of_headings(partitioner):
# Test a large number of heading levels for performance and correctness
for i in range(1, 1000):
style = f"Heading {i}"
expected = i - 1
codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.06ms -> 881μs (20.5% faster)
def test_large_number_of_list_bullets(partitioner):
# Test a large number of list bullet levels
for i in range(1, 1000):
style = f"List Bullet {i}"
expected = i - 1
codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.77ms -> 1.05ms (67.8% faster)
def test_large_number_of_list_numbers(partitioner):
# Test a large number of list number levels
for i in range(1, 1000):
style = f"List Number {i}"
expected = i - 1
codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.77ms -> 1.05ms (69.0% faster)
def test_large_number_of_non_matching_styles(partitioner):
# Test a large number of non-matching style names
for i in range(1, 1000):
style = f"Custom Style {i}"
codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.47ms -> 583μs (151% faster)
def test_large_mixed_styles(partitioner):
# Test a mixture of all types in a large batch
for i in range(1, 250):
codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") # 282μs -> 229μs (23.2% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List {i}")
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}") # 434μs -> 265μs (63.7% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Continue {i}")
codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}") # 431μs -> 258μs (66.5% faster)
codeflash_output = partitioner._parse_category_depth_by_style_name(f"Random Style {i}")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import TextIO
from unstructured.partition.docx import DocxPartitionerOptions
from unstructured.partition.docx import _DocxPartitioner
def test__DocxPartitioner__parse_category_depth_by_style_name():
_DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=None, file_path='', include_page_breaks=False, infer_table_structure=True, starting_page_number=0, strategy=None)), 'List\x00\x00\x00\x00')
def test__DocxPartitioner__parse_category_depth_by_style_name_2():
_DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=None, file_path=None, include_page_breaks=False, infer_table_structure=False, starting_page_number=0, strategy=None)), '')
def test__DocxPartitioner__parse_category_depth_by_style_name_3():
_DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=TextIO(), file_path='', include_page_breaks=True, infer_table_structure=False, starting_page_number=0, strategy='')), 'Subtitle')
```
</details>
<details>
<summary>⏪ Replay Tests and Runtime</summary>
</details>
<details>
<summary>🔎 Concolic Coverage Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:---------------------------------------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name`
| 6.64μs | 4.95μs | 34.2%✅ |
|
`codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name_2`
| 4.50μs | 2.90μs | 54.8%✅ |
|
`codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name_3`
| 1.80μs | 1.70μs | 5.58%✅ |
</details>
To edit these changes `git checkout
codeflash/optimize-_DocxPartitioner._parse_category_depth_by_style_name-menbhfu6`
and push.
[](https://codeflash.ai)
---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>