unstructured
1adb0b9f - enhancement: Speed up method `_DocxPartitioner._parse_category_depth_by_style_name` by 69% (#4174)

Commit

35 days ago

enhancement: Speed up method `_DocxPartitioner._parse_category_depth_by_style_name` by 69% (#4174)  ### 📄 69% (0.69x) speedup for ***`_DocxPartitioner._parse_category_depth_by_style_name` in `unstructured/partition/docx.py`*** ⏱️ Runtime : **`8.62 milliseconds`** **→** **`5.11 milliseconds`** (best of `17` runs) ### 📝 Explanation and details The optimized code achieves a **68% speedup** through two key optimizations: **1. Tuple-based prefix matching:** Changed `list_prefixes` from a list to a tuple and replaced the `any()` loop with a single `str.startswith()` call that accepts multiple prefixes. This eliminates the overhead of creating a generator expression and iterating through prefixes one by one. The line profiler shows this optimization reduced the time spent on prefix matching from 39.4% to 10.9% of total execution time. **2. Cached string splitting in `_extract_number()`:** Instead of calling `suffix.split()` twice (once to check the last element and once to extract it), the result is now cached in a `parts` variable. This eliminates redundant string operations when extracting numbers from style names. **Performance characteristics by test case:** - **List styles see the biggest gains** (43-69% faster): The tuple-based prefix matching is most effective here since these styles require prefix checking - **Non-matching styles improve dramatically** (65-151% faster): These benefit from faster rejection through the optimized prefix check - **Heading styles show modest gains** (2-33% faster): These bypass the list prefix logic, so improvements come mainly from the cached splitting - **Large-scale tests demonstrate consistent speedup** (20-69% faster): The optimizations scale well with volume The optimizations are particularly effective for documents with many list-style elements or diverse style names that don't match any prefixes. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **87 Passed** | | 🌀 Generated Regression Tests | ✅ **5555 Passed** | | ⏪ Replay Tests | ✅ **13 Passed** | | 🔎 Concolic Coverage Tests | ✅ **6 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_docx.py::test_parse_category_depth_by_style_name` | 24.5μs | 17.3μs | 41.7%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations # imports import pytest from unstructured.partition.docx import _DocxPartitioner # unit tests @pytest.fixture def partitioner(): # Provide a partitioner instance for use in tests return _DocxPartitioner() # -------------------------- # 1. Basic Test Cases # -------------------------- def test_heading_level_1(partitioner): # Heading 1 should map to depth 0 codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1") def test_heading_level_2(partitioner): # Heading 2 should map to depth 1 codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") def test_heading_level_10(partitioner): # Heading 10 should map to depth 9 codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 10") def test_subtitle(partitioner): # Subtitle should map to depth 1 codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") def test_list_bullet_1(partitioner): # List Bullet 1 should map to depth 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 1") def test_list_bullet_3(partitioner): # List Bullet 3 should map to depth 2 codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 3") def test_list_number_2(partitioner): # List Number 2 should map to depth 1 codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 2") def test_list_continue_5(partitioner): # List Continue 5 should map to depth 4 codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 5") def test_list_plain(partitioner): # "List" without a number should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List") def test_normal_style(partitioner): # Any non-special style should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("Normal") def test_random_style(partitioner): # Unknown style name should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("RandomStyle") # -------------------------- # 2. Edge Test Cases # -------------------------- def test_heading_with_extra_spaces(partitioner): # Heading with extra spaces should still parse the last word as number if possible codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 3") def test_heading_without_number(partitioner): # Heading with no number should map to 0 (since no number to subtract 1) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading") def test_list_bullet_with_non_digit_suffix(partitioner): # List Bullet with non-digit at end should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet foo") def test_list_number_with_large_number(partitioner): # List Number with a large number codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 999") def test_empty_string(partitioner): # Empty string should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("") def test_case_sensitivity(partitioner): # Should be case-sensitive: "heading 1" does not match "Heading" codeflash_output = partitioner._parse_category_depth_by_style_name("heading 1") def test_subtitle_case(partitioner): # "subtitle" (lowercase) should not match "Subtitle" codeflash_output = partitioner._parse_category_depth_by_style_name("subtitle") def test_list_bullet_with_multiple_spaces(partitioner): # List Bullet with multiple spaces before number codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 2") def test_style_name_with_trailing_space(partitioner): # Style name with trailing space codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 4 ") def test_style_name_with_leading_space(partitioner): # Style name with leading space codeflash_output = partitioner._parse_category_depth_by_style_name(" List Bullet 2") def test_style_name_with_internal_non_digit(partitioner): # Heading with non-digit in the number position codeflash_output = partitioner._parse_category_depth_by_style_name("Heading X") def test_style_name_with_number_in_middle(partitioner): # Only the last word is checked for a digit codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2 Extra") def test_list_continue_with_no_number(partitioner): # List Continue with no number should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue") def test_style_name_with_special_characters(partitioner): # Style name with special characters should not break function codeflash_output = partitioner._parse_category_depth_by_style_name("Heading #$%") def test_list_prefix_overlap(partitioner): # "List BulletPoint 2" does not match any valid prefix, so should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List BulletPoint 2") # -------------------------- # 3. Large Scale Test Cases # -------------------------- def test_many_headings(partitioner): # Test a large number of headings, up to 1000 for i in range(1, 1001): # "Heading N" should map to N-1 codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") def test_many_list_bullets(partitioner): # Test a large number of list bullets, up to 1000 for i in range(1, 1001): codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}") def test_many_list_numbers(partitioner): # Test a large number of list numbers, up to 1000 for i in range(1, 1001): codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}") def test_mixed_styles_large_scale(partitioner): # Mix a large number of different style names, including edge cases for i in range(1, 501): # Headings codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") # List Bullets codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}") # List Number codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}") # List Continue codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Continue {i}") # Unknown style codeflash_output = partitioner._parse_category_depth_by_style_name(f"Unknown {i}") # Heading with non-digit codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}x") def test_large_scale_with_unusual_inputs(partitioner): # Test 1000 random/edge case style names for i in range(1, 1001): # Style with only number codeflash_output = partitioner._parse_category_depth_by_style_name(str(i)) # Style with number at start codeflash_output = partitioner._parse_category_depth_by_style_name(f"{i} Heading") # Style with number in middle codeflash_output = partitioner._parse_category_depth_by_style_name(f"List {i} Bullet") # Style with extra spaces codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. #------------------------------------------------ from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.docx import _DocxPartitioner # function to test # pyright: reportPrivateUsage=false class DocxPartitionerOptions: pass from unstructured.partition.docx import _DocxPartitioner # unit tests @pytest.fixture def partitioner(): # Fixture to create a _DocxPartitioner instance return _DocxPartitioner(DocxPartitionerOptions()) # 1. Basic Test Cases def test_heading_styles_basic(partitioner): # Test standard heading styles codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1") # 4.50μs -> 4.41μs (2.16% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") # 1.45μs -> 1.41μs (2.62% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 3") # 1.23μs -> 1.07μs (14.6% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 10") # 1.45μs -> 1.09μs (33.2% faster) def test_subtitle_style(partitioner): # Test the special case for 'Subtitle' codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") # 1.97μs -> 1.89μs (4.18% faster) def test_list_styles_basic(partitioner): # Test basic list styles codeflash_output = partitioner._parse_category_depth_by_style_name("List 1") # 6.28μs -> 4.37μs (43.6% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 2") # 2.53μs -> 1.59μs (59.8% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 10") # 2.36μs -> 1.47μs (60.8% faster) def test_list_bullet_styles(partitioner): # Test 'List Bullet' styles codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 1") # 6.13μs -> 4.47μs (37.1% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 2") # 2.53μs -> 1.71μs (47.9% faster) def test_list_continue_styles(partitioner): # Test 'List Continue' styles codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 1") # 6.34μs -> 4.41μs (43.9% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 5") # 2.59μs -> 1.75μs (48.1% faster) def test_list_number_styles(partitioner): # Test 'List Number' styles codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 1") # 6.25μs -> 4.34μs (44.0% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 3") # 2.54μs -> 1.72μs (48.2% faster) def test_other_styles_default_to_zero(partitioner): # Test styles that should default to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("Normal") # 4.09μs -> 2.48μs (65.0% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Body Text") # 1.94μs -> 913ns (113% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Title") # 1.65μs -> 728ns (127% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Random Style") # 1.58μs -> 727ns (117% faster) # 2. Edge Test Cases def test_heading_without_number(partitioner): # Test 'Heading' with no number codeflash_output = partitioner._parse_category_depth_by_style_name("Heading") # 3.04μs -> 3.10μs (1.97% slower) def test_list_without_number(partitioner): # Test 'List' with no number codeflash_output = partitioner._parse_category_depth_by_style_name("List") # 5.24μs -> 3.63μs (44.4% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet") # 2.56μs -> 1.72μs (49.1% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue") # 1.66μs -> 1.08μs (53.4% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Number") # 1.52μs -> 932ns (63.2% faster) def test_heading_with_non_numeric_suffix(partitioner): # Test 'Heading' with a non-numeric suffix codeflash_output = partitioner._parse_category_depth_by_style_name("Heading One") # 3.37μs -> 3.44μs (1.95% slower) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading X") # 1.36μs -> 1.32μs (2.81% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1A") # 953ns -> 983ns (3.05% slower) def test_list_with_non_numeric_suffix(partitioner): # Test 'List' with a non-numeric suffix codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet X") # 5.65μs -> 3.98μs (42.2% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue A") # 2.24μs -> 1.61μs (38.9% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Number Foo") # 1.76μs -> 1.22μs (44.1% faster) def test_case_sensitivity(partitioner): # Test that style names are case-sensitive codeflash_output = partitioner._parse_category_depth_by_style_name("heading 1") # 3.98μs -> 2.35μs (69.0% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("HEADING 1") # 2.01μs -> 935ns (115% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") # 665ns -> 591ns (12.5% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("subtitle") # 1.71μs -> 803ns (113% faster) def test_empty_and_whitespace_styles(partitioner): # Test empty string and whitespace-only style names codeflash_output = partitioner._parse_category_depth_by_style_name("") # 4.14μs -> 2.40μs (72.4% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(" ") # 1.94μs -> 808ns (139% faster) def test_style_name_with_extra_spaces(partitioner): # Test style names with extra spaces codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") # 3.79μs -> 3.78μs (0.371% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 3") # 4.11μs -> 2.14μs (92.0% faster) def test_style_name_with_leading_trailing_spaces(partitioner): # Test style names with leading/trailing spaces codeflash_output = partitioner._parse_category_depth_by_style_name(" Heading 1") # 3.97μs -> 2.44μs (62.9% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 2 ") # 4.34μs -> 3.08μs (41.2% faster) def test_style_name_with_multiple_words(partitioner): # Test style names with multiple words that don't match any prefix codeflash_output = partitioner._parse_category_depth_by_style_name("My Custom Heading 1") # 3.81μs -> 2.31μs (65.1% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet Special 2") # 4.46μs -> 3.08μs (45.0% faster) def test_style_name_with_large_number(partitioner): # Test styles with very large numbers codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 999") # 4.21μs -> 4.06μs (3.64% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 1000") # 4.09μs -> 2.19μs (86.9% faster) # 3. Large Scale Test Cases def test_large_number_of_headings(partitioner): # Test a large number of heading levels for performance and correctness for i in range(1, 1000): style = f"Heading {i}" expected = i - 1 codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.06ms -> 881μs (20.5% faster) def test_large_number_of_list_bullets(partitioner): # Test a large number of list bullet levels for i in range(1, 1000): style = f"List Bullet {i}" expected = i - 1 codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.77ms -> 1.05ms (67.8% faster) def test_large_number_of_list_numbers(partitioner): # Test a large number of list number levels for i in range(1, 1000): style = f"List Number {i}" expected = i - 1 codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.77ms -> 1.05ms (69.0% faster) def test_large_number_of_non_matching_styles(partitioner): # Test a large number of non-matching style names for i in range(1, 1000): style = f"Custom Style {i}" codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.47ms -> 583μs (151% faster) def test_large_mixed_styles(partitioner): # Test a mixture of all types in a large batch for i in range(1, 250): codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") # 282μs -> 229μs (23.2% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(f"List {i}") codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}") # 434μs -> 265μs (63.7% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Continue {i}") codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}") # 431μs -> 258μs (66.5% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(f"Random Style {i}") # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. #------------------------------------------------ from typing import TextIO from unstructured.partition.docx import DocxPartitionerOptions from unstructured.partition.docx import _DocxPartitioner def test__DocxPartitioner__parse_category_depth_by_style_name(): _DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=None, file_path='', include_page_breaks=False, infer_table_structure=True, starting_page_number=0, strategy=None)), 'List\x00\x00\x00\x00') def test__DocxPartitioner__parse_category_depth_by_style_name_2(): _DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=None, file_path=None, include_page_breaks=False, infer_table_structure=False, starting_page_number=0, strategy=None)), '') def test__DocxPartitioner__parse_category_depth_by_style_name_3(): _DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=TextIO(), file_path='', include_page_breaks=True, infer_table_structure=False, starting_page_number=0, strategy='')), 'Subtitle') ``` </details> <details> <summary>⏪ Replay Tests and Runtime</summary> </details> <details> <summary>🔎 Concolic Coverage Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name` | 6.64μs | 4.95μs | 34.2%✅ | | `codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name_2` | 4.50μs | 2.90μs | 54.8%✅ | | `codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name_3` | 1.80μs | 1.70μs | 5.58%✅ | </details> To edit these changes `git checkout codeflash/optimize-_DocxPartitioner._parse_category_depth_by_style_name-menbhfu6` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>

References

#4174 - ⚡️ Speed up method `_DocxPartitioner._parse_category_depth_by_style_name` by 69%

Author

aseembits93

Parents

7fd3fccd

unstructured 1adb0b9f - enhancement: Speed up method `_DocxPartitioner._parse_category_depth_by_style_name` by 69% (#4174)

unstructured
1adb0b9f - enhancement: Speed up method `_DocxPartitioner._parse_category_depth_by_style_name` by 69% (#4174)