unstructured
a55810de - enhancement: Speed up method `_DocxPartitioner._style_based_element_type` by 593% (#4179)

Commit

29 days ago

enhancement: Speed up method `_DocxPartitioner._style_based_element_type` by 593% (#4179)  #### 📄 593% (5.93x) speedup for ***`_DocxPartitioner._style_based_element_type` in `unstructured/partition/docx.py`*** ⏱️ Runtime : **`5.53 milliseconds`** **→** **`798 microseconds`** (best of `116` runs) #### 📝 Explanation and details The optimization achieves a **593% speedup** by moving the `STYLE_TO_ELEMENT_MAPPING` dictionary from inside the method to module level as a global constant. **What changed:** - Moved the 29-entry dictionary definition from inside `_style_based_element_type()` to the module level as `STYLE_TO_ELEMENT_MAPPING` - The method now simply references the pre-built dictionary instead of reconstructing it on every call **Why this is dramatically faster:** The original code was reconstructing a 29-entry dictionary on every single method invocation. The line profiler shows this dictionary creation consumed **58.7% of total execution time** (33.7ms out of 57.6ms total). Each dictionary entry required individual object creation and insertion operations, creating significant overhead when called repeatedly. By moving the dictionary to module level, it's constructed only once when the module is imported, eliminating this repeated work entirely. The optimized version shows the dictionary lookup now takes only 53.3% of the much smaller total time. **Performance characteristics:** - **All test cases** show 300-600% speedups, indicating consistent benefits across different style types - **Large-scale tests** with 800-1000 paragraphs show particularly strong gains (518-645% speedups), demonstrating the optimization scales well with volume - **Edge cases** (None styles, unknown styles) benefit equally, showing the optimization doesn't create performance regressions This optimization is especially valuable for document processing workloads where `_style_based_element_type()` is called repeatedly for each paragraph in potentially large documents, making the cumulative time savings substantial. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **5947 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from types import SimpleNamespace # function to test # (copied from above, with necessary imports and dummy DocxPartitionerOptions) # imports from unstructured.partition.docx import _DocxPartitioner # Dummy DocxPartitionerOptions for __init__ signature class DocxPartitionerOptions: pass # Dummy element types for testing class Text: pass class Title: pass class ListItem: pass # Helper to create a mock Paragraph object with a given style name def make_paragraph(style_name=None): if style_name is None: # Simulate paragraph.style is None return SimpleNamespace(style=None) else: # Simulate paragraph.style.name style = SimpleNamespace(name=style_name) return SimpleNamespace(style=style) # Basic Test Cases def test_heading_styles(): """Test that heading styles map to Title.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) for i in range(1, 10): para = make_paragraph(f"Heading {i}") codeflash_output = partitioner._style_based_element_type( para ) # 9.46μs -> 1.79μs (428% faster) def test_title_style(): """Test that 'Title' style maps to Title.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph("Title") codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 291ns (344% faster) def test_subtitle_style(): """Test that 'Subtitle' style maps to Title.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph("Subtitle") codeflash_output = partitioner._style_based_element_type(para) # 1.25μs -> 292ns (328% faster) def test_tocheading_style(): """Test that 'TOCHeading' style maps to Title.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph("TOCHeading") codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 291ns (344% faster) def test_list_styles(): """Test that various list styles map to ListItem.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) list_styles = [ "List", "List 2", "List 3", "List Bullet", "List Bullet 2", "List Bullet 3", "List Continue", "List Continue 2", "List Continue 3", "List Number", "List Number 2", "List Number 3", "List Paragraph", ] for style in list_styles: para = make_paragraph(style) codeflash_output = partitioner._style_based_element_type( para ) # 12.7μs -> 2.04μs (521% faster) def test_text_styles(): """Test that various text styles map to Text.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) text_styles = ["Caption", "Intense Quote", "Macro Text", "No Spacing", "Quote"] for style in text_styles: para = make_paragraph(style) codeflash_output = partitioner._style_based_element_type( para ) # 5.12μs -> 873ns (487% faster) def test_unknown_style_returns_none(): """Test that unknown style names return None.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph("MyCustomStyle") codeflash_output = partitioner._style_based_element_type(para) # 1.33μs -> 291ns (358% faster) def test_normal_style_returns_none(): """Test that 'Normal' style returns None.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph("Normal") codeflash_output = partitioner._style_based_element_type(para) # 1.33μs -> 291ns (358% faster) # Edge Test Cases def test_paragraph_style_is_none(): """Test that paragraph.style is None returns None (treated as 'Normal').""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph(None) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (416% faster) def test_paragraph_style_name_is_none(): """Test that paragraph.style.name is None returns None (treated as 'Normal').""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) style = SimpleNamespace(name=None) para = SimpleNamespace(style=style) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 291ns (344% faster) def test_style_name_case_sensitivity(): """Test that style names are case sensitive.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph("title") # lower-case, should not match codeflash_output = partitioner._style_based_element_type(para) # 2.38μs -> 667ns (256% faster) def test_style_name_with_whitespace(): """Test that style names with extra whitespace do not match.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph(" Title ") codeflash_output = partitioner._style_based_element_type(para) # 1.67μs -> 333ns (401% faster) def test_style_name_is_empty_string(): """Test that empty string style name returns None.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) para = make_paragraph("") codeflash_output = partitioner._style_based_element_type(para) # 1.62μs -> 291ns (458% faster) def test_style_name_is_integer(): """Test that non-string style name returns None.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) style = SimpleNamespace(name=123) para = SimpleNamespace(style=style) codeflash_output = partitioner._style_based_element_type(para) # 1.50μs -> 333ns (350% faster) def test_style_name_is_none_and_style_is_object(): """Test that style.name is None and style is not None returns None.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) style = SimpleNamespace(name=None) para = SimpleNamespace(style=style) codeflash_output = partitioner._style_based_element_type(para) # 1.38μs -> 292ns (371% faster) # Large Scale Test Cases def test_large_number_of_paragraphs_known_styles(): """Test performance and correctness with many paragraphs of known styles.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) styles = [ "Heading 1", "Heading 2", "Heading 3", "List", "List Bullet", "Title", "Caption", "Quote", ] expected_types = { "Heading 1": Title, "Heading 2": Title, "Heading 3": Title, "List": ListItem, "List Bullet": ListItem, "Title": Title, "Caption": Text, "Quote": Text, } paragraphs = [make_paragraph(style) for style in styles * 100] # 800 paragraphs for para in paragraphs: style_name = para.style.name expected = expected_types[style_name] codeflash_output = partitioner._style_based_element_type( para ) # 737μs -> 100μs (637% faster) def test_large_number_of_paragraphs_unknown_styles(): """Test performance and correctness with many paragraphs of unknown styles.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) paragraphs = [make_paragraph(f"CustomStyle{i}") for i in range(1000)] for para in paragraphs: codeflash_output = partitioner._style_based_element_type( para ) # 955μs -> 154μs (518% faster) def test_large_number_of_paragraphs_mixed_styles(): """Test performance and correctness with a mix of known and unknown styles.""" partitioner = _DocxPartitioner(DocxPartitionerOptions()) known_styles = ["Heading 1", "List", "Title", "Caption"] unknown_styles = [f"Unknown{i}" for i in range(500)] paragraphs = [] # Alternate known and unknown styles for i in range(500): paragraphs.append(make_paragraph(known_styles[i % len(known_styles)])) paragraphs.append(make_paragraph(unknown_styles[i])) # 1000 paragraphs total for i, para in enumerate(paragraphs): if i % 2 == 0: # known style style = known_styles[(i // 2) % len(known_styles)] expected = {"Heading 1": Title, "List": ListItem, "Title": Title, "Caption": Text}[ style ] codeflash_output = partitioner._style_based_element_type(para) else: # unknown style codeflash_output = partitioner._style_based_element_type(para) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # imports import pytest from unstructured.partition.docx import _DocxPartitioner # Dummy classes for element types (simulate unstructured.documents.elements) class Text: pass class Title: pass class ListItem: pass # Dummy Paragraph and Style classes to simulate python-docx class DummyStyle: def __init__(self, name): self.name = name class DummyParagraph: def __init__(self, style): self.style = style # Dummy DocxPartitionerOptions (not used in the function, but required for __init__) class DocxPartitionerOptions: pass # unit tests @pytest.fixture def partitioner(): # Returns an instance of _DocxPartitioner for use in tests return _DocxPartitioner(DocxPartitionerOptions()) # --------------------------- # 1. Basic Test Cases # --------------------------- def test_heading_styles_return_title(partitioner): # Test all heading styles map to Title for i in range(1, 10): para = DummyParagraph(DummyStyle(f"Heading {i}")) codeflash_output = partitioner._style_based_element_type( para ) # 9.25μs -> 1.50μs (517% faster) def test_caption_returns_text(partitioner): para = DummyParagraph(DummyStyle("Caption")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (417% faster) def test_quote_returns_text(partitioner): para = DummyParagraph(DummyStyle("Quote")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (416% faster) def test_subtitle_returns_title(partitioner): para = DummyParagraph(DummyStyle("Subtitle")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 291ns (344% faster) def test_title_returns_title(partitioner): para = DummyParagraph(DummyStyle("Title")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 291ns (344% faster) def test_tocheading_returns_title(partitioner): para = DummyParagraph(DummyStyle("TOCHeading")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (416% faster) def test_list_styles_return_listitem(partitioner): list_styles = [ "List", "List 2", "List 3", "List Bullet", "List Bullet 2", "List Bullet 3", "List Continue", "List Continue 2", "List Continue 3", "List Number", "List Number 2", "List Number 3", "List Paragraph", ] for style in list_styles: para = DummyParagraph(DummyStyle(style)) codeflash_output = partitioner._style_based_element_type( para ) # 12.5μs -> 2.04μs (513% faster) def test_macro_text_returns_text(partitioner): para = DummyParagraph(DummyStyle("Macro Text")) codeflash_output = partitioner._style_based_element_type(para) # 1.38μs -> 292ns (371% faster) def test_no_spacing_returns_text(partitioner): para = DummyParagraph(DummyStyle("No Spacing")) codeflash_output = partitioner._style_based_element_type(para) # 1.33μs -> 292ns (357% faster) # --------------------------- # 2. Edge Test Cases # --------------------------- def test_normal_style_returns_none(partitioner): # "Normal" style is not mapped, should return None para = DummyParagraph(DummyStyle("Normal")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (417% faster) def test_none_style_returns_none(partitioner): # Paragraph.style is None, should be treated as "Normal" para = DummyParagraph(None) codeflash_output = partitioner._style_based_element_type(para) # 1.25μs -> 250ns (400% faster) def test_style_object_with_none_name_returns_none(partitioner): # Paragraph.style.name is None, should be treated as "Normal" style = DummyStyle(None) para = DummyParagraph(style) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (416% faster) def test_unknown_style_returns_none(partitioner): # Unknown style should return None para = DummyParagraph(DummyStyle("MyCustomStyle")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (417% faster) def test_style_name_empty_string_returns_none(partitioner): # Empty string style name should return None para = DummyParagraph(DummyStyle("")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 250ns (417% faster) def test_style_name_whitespace_returns_none(partitioner): # Whitespace style name should return None para = DummyParagraph(DummyStyle(" ")) codeflash_output = partitioner._style_based_element_type(para) # 1.29μs -> 291ns (344% faster) def test_style_name_case_sensitivity(partitioner): # Should be case-sensitive: "heading 1" should not match "Heading 1" para = DummyParagraph(DummyStyle("heading 1")) codeflash_output = partitioner._style_based_element_type(para) # 2.00μs -> 500ns (300% faster) def test_style_name_with_extra_spaces(partitioner): # Should not match if extra spaces are present para = DummyParagraph(DummyStyle(" Heading 1 ")) codeflash_output = partitioner._style_based_element_type(para) # 1.54μs -> 375ns (311% faster) # --------------------------- # 3. Large Scale Test Cases # --------------------------- def test_large_batch_of_paragraphs_mixed_styles(partitioner): # Create a large number of paragraphs with a mix of known and unknown styles known_styles = [ "Heading 1", "Caption", "List Bullet", "Quote", "Subtitle", "Title", "TOCHeading", ] unknown_styles = ["CustomStyleA", "Unknown", "heading 1", "", " ", None] paragraphs = [] # Alternate between known and unknown styles for i in range(500): style_name = known_styles[i % len(known_styles)] paragraphs.append(DummyParagraph(DummyStyle(style_name))) style_name = unknown_styles[i % len(unknown_styles)] paragraphs.append(DummyParagraph(DummyStyle(style_name))) # Check that known styles return correct types, unknown return None for i, para in enumerate(paragraphs): style_name = para.style.name if para.style else None if style_name in known_styles: expected_type = { "Heading 1": Title, "Caption": Text, "List Bullet": ListItem, "Quote": Text, "Subtitle": Title, "Title": Title, "TOCHeading": Title, }[style_name] codeflash_output = partitioner._style_based_element_type(para) else: codeflash_output = partitioner._style_based_element_type(para) def test_all_mapping_styles_are_covered(partitioner): # Ensure every style in the mapping returns the correct type mapping = { "Caption": Text, "Heading 1": Title, "Heading 2": Title, "Heading 3": Title, "Heading 4": Title, "Heading 5": Title, "Heading 6": Title, "Heading 7": Title, "Heading 8": Title, "Heading 9": Title, "Intense Quote": Text, "List": ListItem, "List 2": ListItem, "List 3": ListItem, "List Bullet": ListItem, "List Bullet 2": ListItem, "List Bullet 3": ListItem, "List Continue": ListItem, "List Continue 2": ListItem, "List Continue 3": ListItem, "List Number": ListItem, "List Number 2": ListItem, "List Number 3": ListItem, "List Paragraph": ListItem, "Macro Text": Text, "No Spacing": Text, "Quote": Text, "Subtitle": Title, "TOCHeading": Title, "Title": Title, } for style_name, expected_type in mapping.items(): para = DummyParagraph(DummyStyle(style_name)) codeflash_output = partitioner._style_based_element_type( para ) # 28.2μs -> 4.28μs (559% faster) def test_large_number_of_unknown_styles_returns_none(partitioner): # Test with 1000 paragraphs with unknown styles for i in range(1000): para = DummyParagraph(DummyStyle(f"UnknownStyle{i}")) codeflash_output = partitioner._style_based_element_type( para ) # 939μs -> 140μs (568% faster) def test_large_number_of_normal_and_none_styles_returns_none(partitioner): # Test with 500 paragraphs with "Normal" and 500 with None style for i in range(500): para_normal = DummyParagraph(DummyStyle("Normal")) para_none = DummyParagraph(None) codeflash_output = partitioner._style_based_element_type( para_normal ) # 459μs -> 61.7μs (645% faster) codeflash_output = partitioner._style_based_element_type(para_none) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python ``` </details> To edit these changes `git checkout codeflash/optimize-_DocxPartitioner._style_based_element_type-mjdwusew` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>

References

#4179 - ⚡️ Speed up method `_DocxPartitioner._style_based_element_type` by 593%

Author

aseembits93

Parents

6abc5dfa

unstructured a55810de - enhancement: Speed up method `_DocxPartitioner._style_based_element_type` by 593% (#4179)

unstructured
a55810de - enhancement: Speed up method `_DocxPartitioner._style_based_element_type` by 593% (#4179)