unstructured
174a6ae9 - enhancement: Speed up function `sentence_count` by 1,038% (#4160)

Commit

129 days ago

enhancement: Speed up function `sentence_count` by 1,038% (#4160)  #### 📄 1,038% (10.38x) speedup for ***`sentence_count` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`51.8 milliseconds`** **→** **`4.55 milliseconds`** (best of `14` runs) #### 📝 Explanation and details The optimized code achieves a **1037% speedup (51.8ms → 4.55ms)** through two key optimizations: ## 1. **Caching Fix for `sent_tokenize` (Primary Speedup)** **Problem**: The original code applied `@lru_cache` directly to `sent_tokenize`, but NLTK's `_sent_tokenize` returns a `List[str]`, which is **unhashable** and cannot be cached properly by Python's `lru_cache`. **Solution**: The optimized version introduces a two-layer approach: - `_tokenize_for_cache()` - Cached function that returns `Tuple[str, ...]` (hashable) - `sent_tokenize()` - Public wrapper that converts tuple to list **Why it's faster**: This enables **actual caching** of tokenization results. The test annotations show dramatic speedups (up to **35,000% faster**) on repeated text, confirming the cache now works. Since `sentence_count` tokenizes the same text patterns repeatedly across function calls, this cache hit rate is crucial. **Impact on hot paths**: Based on `function_references`, this function is called from: - `is_possible_narrative_text()` - checks if text contains ≥2 sentences with `sentence_count(text, 3)` - `is_possible_title()` - validates single-sentence constraint with `sentence_count(text, min_length=...)` - `exceeds_cap_ratio()` - checks sentence count to avoid multi-sentence text These are all text classification functions likely invoked repeatedly during document parsing, making the caching fix highly impactful. ## 2. **Branch Prediction Optimization in `sentence_count`** **Change**: Split the loop into two branches - one for `min_length` case, one for no filtering: ```python if min_length: # Loop with filtering logic else: # Simple counting loop ``` **Why it's faster**: - Eliminates repeated `if min_length:` checks inside the loop (7,181 checks in profiler) - Allows CPU branch predictor to optimize each loop independently - Hoists `trace_logger.detail` lookup outside loop (68 calls vs 3,046+ attribute lookups) **Test results validation**: - Cases **without** `min_length` show **massive speedups** (3,000-35,000%) due to pure caching benefits - Cases **with** `min_length` show **moderate speedups** (60-940%) since filtering logic still executes, but benefits from reduced overhead and hoisting The optimization is most effective for workloads that process similar text patterns repeatedly (common in document parsing pipelines) and particularly when `min_length` is not specified, which appears to be the common case based on function references. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **21 Passed** | | 🌀 Generated Regression Tests | ✅ **60 Passed** | | ⏪ Replay Tests | ✅ **5 Passed** | | 🔎 Concolic Coverage Tests | ✅ **1 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Click to see Existing Unit Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_item_titles` | 47.2μs | 8.06μs | 486%✅ | | `partition/test_text_type.py::test_sentence_count` | 4.34μs | 1.81μs | 139%✅ | </details> <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python # imports from unstructured.partition.text_type import sentence_count # Basic Test Cases def test_single_sentence(): # Simple single sentence text = "This is a test sentence." codeflash_output = sentence_count(text) # 20.1μs -> 2.52μs (697% faster) def test_multiple_sentences(): # Multiple sentences separated by periods text = "This is the first sentence. This is the second sentence. Here is a third." codeflash_output = sentence_count(text) # 62.7μs -> 1.58μs (3868% faster) def test_sentences_with_various_punctuation(): # Sentences ending with different punctuation text = "Is this a question? Yes! It is." codeflash_output = sentence_count(text) # 44.1μs -> 1.48μs (2879% faster) def test_sentence_with_min_length_none(): # min_length=None should count all sentences text = "Short. Another one." codeflash_output = sentence_count(text, min_length=None) # 27.0μs -> 1.59μs (1595% faster) def test_sentence_with_min_length(): # Only sentences with at least min_length words are counted text = "Short. This is a long enough sentence." codeflash_output = sentence_count(text, min_length=4) # 33.2μs -> 13.5μs (146% faster) def test_sentence_with_min_length_exact(): # Sentence with exactly min_length words should be counted text = "One two three four." codeflash_output = sentence_count(text, min_length=4) # 10.1μs -> 5.04μs (99.5% faster) # Edge Test Cases def test_empty_string(): # Empty string should return 0 codeflash_output = sentence_count("") # 5.30μs -> 1.04μs (409% faster) def test_whitespace_only(): # String with only whitespace should return 0 codeflash_output = sentence_count(" ") # 5.26μs -> 888ns (493% faster) def test_no_sentence_punctuation(): # Text with no sentence-ending punctuation is treated as one sentence by NLTK text = "This is just a run on sentence with no punctuation" codeflash_output = sentence_count(text) # 8.34μs -> 1.13μs (638% faster) def test_sentence_with_only_punctuation(): # Sentences that are just punctuation should not be counted if min_length is set text = "!!! ... ???" codeflash_output = sentence_count(text, min_length=1) # 79.0μs -> 7.59μs (940% faster) def test_sentence_with_non_ascii_punctuation(): # Sentences with Unicode punctuation text = "This is a test sentence。This is another！" # NLTK may not split these as sentences; check for at least 1 codeflash_output = sentence_count(text) # 10.9μs -> 1.13μs (871% faster) def test_sentence_with_abbreviations(): # Abbreviations should not split sentences incorrectly text = "Dr. Smith went to Washington. He arrived at 10 a.m. sharp." codeflash_output = sentence_count(text) # 57.9μs -> 1.43μs (3959% faster) def test_sentence_with_newlines(): # Sentences separated by newlines text = "First sentence.\nSecond sentence!\n\nThird sentence?" codeflash_output = sentence_count(text) # 43.2μs -> 1.34μs (3113% faster) def test_sentence_with_multiple_spaces(): # Sentences with irregular spacing text = "First sentence. Second sentence. " codeflash_output = sentence_count(text) # 27.6μs -> 1.16μs (2282% faster) def test_sentence_with_min_length_zero(): # min_length=0 should count all sentences text = "A. B." codeflash_output = sentence_count(text, min_length=0) # 27.7μs -> 1.38μs (1909% faster) def test_sentence_with_min_length_greater_than_any_sentence(): # All sentences are too short for min_length text = "A. B." codeflash_output = sentence_count(text, min_length=10) # 5.47μs -> 6.16μs (11.2% slower) def test_sentence_with_just_numbers(): # Sentences that are just numbers text = "12345. 67890." codeflash_output = sentence_count(text) # 31.7μs -> 1.29μs (2350% faster) def test_sentence_with_only_punctuation_and_spaces(): # Only punctuation and spaces text = " . . . " codeflash_output = sentence_count(text) # 34.2μs -> 1.31μs (2502% faster) def test_sentence_with_ellipsis(): # Ellipsis should not break sentence count text = "Wait... what happened? I don't know..." codeflash_output = sentence_count(text) # 44.7μs -> 1.36μs (3182% faster) # Large Scale Test Cases def test_large_number_of_sentences(): # 1000 short sentences text = "Sentence. " * 1000 codeflash_output = sentence_count(text) # 8.26ms -> 23.5μs (35048% faster) def test_large_text_with_long_sentences(): # 500 sentences, each with 10 words sentence = "This is a sentence with exactly ten words." text = " ".join([sentence for _ in range(500)]) codeflash_output = sentence_count(text) # 4.11ms -> 17.3μs (23651% faster) def test_large_text_min_length_filtering(): # 1000 sentences, only half meet min_length short_sentence = "Short." long_sentence = "This is a sufficiently long sentence for testing." text = " ".join([short_sentence, long_sentence] * 500) codeflash_output = sentence_count(text, min_length=5) # 8.78ms -> 1.15ms (664% faster) def test_large_text_all_filtered(): # All sentences filtered out by min_length sentence = "A." text = " ".join([sentence for _ in range(1000)]) codeflash_output = sentence_count(text, min_length=3) # 7.74ms -> 499μs (1450% faster) # Regression/Mutation tests def test_min_length_does_not_count_punctuation_as_word(): # Punctuation-only tokens should not be counted as words text = "This . is . a . test." # Each "is .", "a .", "test." is a sentence, but only the last is a real sentence # NLTK will likely see this as one sentence codeflash_output = sentence_count(text, min_length=2) # 52.5μs -> 7.96μs (560% faster) def test_sentences_with_internal_periods(): # Internal periods (e.g., in abbreviations) do not split sentences text = "This is Mr. Smith. He lives on St. Patrick's street." codeflash_output = sentence_count(text) # 55.1μs -> 1.23μs (4371% faster) def test_sentence_with_trailing_spaces_and_newlines(): # Sentences with trailing spaces and newlines text = "First sentence. \nSecond sentence. \n" codeflash_output = sentence_count(text) # 29.0μs -> 1.19μs (2337% faster) def test_sentence_with_tabs(): # Sentences separated by tabs text = "First sentence.\tSecond sentence." codeflash_output = sentence_count(text) # 30.1μs -> 1.10μs (2645% faster) def test_sentence_with_multiple_types_of_whitespace(): # Sentences separated by various whitespace text = "First sentence.\n\t Second sentence.\r\nThird sentence." codeflash_output = sentence_count(text) # 45.0μs -> 1.30μs (3373% faster) def test_sentence_with_unicode_whitespace(): # Sentences separated by Unicode whitespace text = "First sentence.\u2003Second sentence.\u2029Third sentence." codeflash_output = sentence_count(text) # 47.4μs -> 1.24μs (3714% faster) def test_sentence_with_emojis(): # Sentences containing emojis text = "Hello world! 😀 How are you? 👍" codeflash_output = sentence_count(text) # 47.4μs -> 1.16μs (3989% faster) def test_sentence_with_quotes(): # Sentences with quoted text text = "\"Hello,\" she said. 'How are you?'" codeflash_output = sentence_count(text) # 41.7μs -> 1.07μs (3812% faster) def test_sentence_with_parentheses(): # Sentences with parentheses text = "This is a sentence (with parentheses). Here is another." codeflash_output = sentence_count(text) # 31.5μs -> 1.25μs (2430% faster) def test_sentence_with_brackets_and_braces(): # Sentences with brackets and braces text = "This is [a test]. {Another one}." codeflash_output = sentence_count(text) # 32.4μs -> 1.19μs (2624% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # function to test # For testing, we need to define the sentence_count function and its dependencies. # We'll use the real NLTK sent_tokenize for realistic behavior. # imports from unstructured.partition.text_type import sentence_count # Dummy trace_logger for completeness (no-op) class DummyLogger: def detail(self, msg): pass trace_logger = DummyLogger() # unit tests class TestSentenceCount: # --- Basic Test Cases --- def test_empty_string(self): # Should return 0 for empty string codeflash_output = sentence_count("") # 747ns -> 1.25μs (40.0% slower) def test_single_sentence(self): # Should return 1 for a simple sentence codeflash_output = sentence_count("This is a test.") # 10.2μs -> 1.09μs (834% faster) def test_multiple_sentences(self): # Should return correct count for multiple sentences codeflash_output = sentence_count( "This is a test. Here is another sentence. And a third one!" ) # 51.5μs -> 1.38μs (3625% faster) def test_sentences_with_varied_punctuation(self): # Should handle sentences ending with ! and ? codeflash_output = sentence_count( "Is this working? Yes! It is." ) # 43.1μs -> 1.18μs (3552% faster) def test_sentences_with_abbreviations(self): # Should not split on abbreviations like "Dr.", "Mr.", "e.g." text = "Dr. Smith went to Washington. He arrived at 10 a.m. sharp." # NLTK correctly splits into 2 sentences codeflash_output = sentence_count(text) # 4.49μs -> 1.24μs (261% faster) def test_sentences_with_newlines(self): # Should handle newlines between sentences text = "First sentence.\nSecond sentence!\n\nThird sentence?" codeflash_output = sentence_count(text) # 4.22μs -> 1.08μs (289% faster) def test_min_length_parameter(self): # Only sentences with >= min_length words should be counted text = "Short. This one is long enough. Ok." # Only "This one is long enough" has >= 4 words codeflash_output = sentence_count(text, min_length=4) # 49.1μs -> 10.5μs (366% faster) def test_min_length_zero(self): # min_length=0 should count all sentences text = "A. B. C." codeflash_output = sentence_count(text, min_length=0) # 43.5μs -> 1.42μs (2954% faster) def test_min_length_none(self): # min_length=None should count all sentences text = "A. B. C." codeflash_output = sentence_count(text, min_length=None) # 2.09μs -> 1.28μs (63.4% faster) # --- Edge Test Cases --- def test_only_punctuation(self): # Only punctuation, no words codeflash_output = sentence_count("...!!!???") # 33.4μs -> 1.27μs (2525% faster) def test_sentence_with_only_spaces(self): # Spaces only should yield 0 codeflash_output = sentence_count(" ") # 5.67μs -> 862ns (557% faster) def test_sentence_with_emoji_and_symbols(self): # Emojis and symbols should not count as sentences codeflash_output = sentence_count("😀 😂 🤔") # 8.09μs -> 1.16μs (598% faster) def test_sentence_with_mixed_unicode(self): # Should handle unicode characters and punctuation text = "Café. Voilà! Привет мир. こんにちは世界。" # NLTK may split Japanese as one sentence, Russian as one, etc. # Let's check for at least 3 sentences (English, French, Russian) codeflash_output = sentence_count(text) count = codeflash_output # 71.8μs -> 1.34μs (5243% faster) def test_sentence_with_no_sentence_endings(self): # No sentence-ending punctuation, should be one sentence text = "This is a sentence without ending punctuation" codeflash_output = sentence_count(text) # 8.12μs -> 1.07μs (659% faster) def test_sentence_with_ellipses(self): # Ellipses should not break sentences text = "Wait... what happened? I don't know..." codeflash_output = sentence_count(text) # 3.83μs -> 1.17μs (227% faster) def test_sentence_with_multiple_spaces_and_tabs(self): # Should handle excessive whitespace correctly text = "Sentence one. \t Sentence two. \n\n Sentence three." codeflash_output = sentence_count(text) # 43.0μs -> 1.12μs (3753% faster) def test_sentence_with_numbers_and_periods(self): # Numbers with periods should not split sentences text = "The value is 3.14. Next sentence." codeflash_output = sentence_count(text) # 32.3μs -> 1.15μs (2714% faster) def test_sentence_with_bullet_points(self): # Should not count bullets as sentences text = "- Item one\n- Item two\n- Item three" codeflash_output = sentence_count(text) # 7.78μs -> 1.01μs (666% faster) def test_sentence_with_long_word_and_min_length(self): # One long word (no spaces) with min_length > 1 should not count codeflash_output = sentence_count( "Supercalifragilisticexpialidocious.", min_length=2 ) # 11.3μs -> 7.04μs (59.9% faster) def test_sentence_with_repeated_punctuation(self): # Should not split on repeated punctuation without sentence-ending text = "Hello!!! How are you??? Fine..." codeflash_output = sentence_count(text) # 48.3μs -> 1.22μs (3867% faster) def test_sentence_with_internal_periods(self): # Internal periods (e.g., URLs) should not split sentences text = "Check out www.example.com. This is a new sentence." codeflash_output = sentence_count(text) # 31.0μs -> 1.22μs (2439% faster) def test_sentence_with_parentheses_and_quotes(self): text = 'He said, "Hello there." (And then he left.)' # Should count as two sentences codeflash_output = sentence_count(text) # 41.6μs -> 1.18μs (3430% faster) # --- Large Scale Test Cases --- def test_large_text_many_sentences(self): # Test with 500 sentences text = "This is a sentence. " * 500 codeflash_output = sentence_count(text) # 3.91ms -> 13.9μs (28106% faster) def test_large_text_with_min_length(self): # 1000 sentences, but only every other one is long enough text = "" for i in range(1000): if i % 2 == 0: text += "Short. " else: text += "This sentence is long enough for the test. " # Only 500 sentences should meet min_length=5 codeflash_output = sentence_count(text, min_length=5) # 8.33ms -> 1.08ms (671% faster) def test_large_text_no_sentence_endings(self): # One very long sentence without punctuation text = " ".join(["word"] * 1000) codeflash_output = sentence_count(text) # 31.3μs -> 3.09μs (913% faster) def test_large_text_all_too_short(self): # 1000 one-word sentences, min_length=2, should return 0 text = ". ".join(["A"] * 1000) + "." codeflash_output = sentence_count(text, min_length=2) # 538μs -> 502μs (7.18% faster) def test_large_text_all_counted(self): # 1000 sentences, all long enough text = "This is a valid sentence. " * 1000 codeflash_output = sentence_count(text, min_length=4) # 8.46ms -> 1.12ms (655% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from unstructured.partition.text_type import sentence_count def test_sentence_count(): sentence_count("!", min_length=None) ``` </details> <details> <summary>⏪ Click to see Replay Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `test_benchmark6_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count` | 35.2μs | 20.5μs | 72.0%✅ | </details> <details> <summary>🔎 Click to see Concolic Coverage Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-----------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_jzsax6p2/tmpkbdw6p4k/test_concolic_coverage.py::test_sentence_count` | 10.8μs | 2.23μs | 385%✅ | </details> To edit these changes `git checkout codeflash/optimize-sentence_count-mjihf0yi` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>

References

#4160 - ⚡️ Speed up function `sentence_count` by 1,038%

Author

misrasaurabh1

Parents

ae0efca8

unstructured 174a6ae9 - enhancement: Speed up function `sentence_count` by 1,038% (#4160)

unstructured
174a6ae9 - enhancement: Speed up function `sentence_count` by 1,038% (#4160)