enhancement: Speed up function `sentence_count` by 1,038% (#4160)
<!-- CODEFLASH_OPTIMIZATION:
{"function":"sentence_count","file":"unstructured/partition/text_type.py","speedup_pct":"1,038%","speedup_x":"10.38x","original_runtime":"51.8
milliseconds","best_runtime":"4.55
milliseconds","optimization_type":"loop","timestamp":"2025-12-23T11:08:46.623Z","version":"1.0"}
-->
#### 📄 1,038% (10.38x) speedup for ***`sentence_count` in
`unstructured/partition/text_type.py`***
⏱️ Runtime : **`51.8 milliseconds`** **→** **`4.55 milliseconds`** (best
of `14` runs)
#### 📝 Explanation and details
The optimized code achieves a **1037% speedup (51.8ms → 4.55ms)**
through two key optimizations:
## 1. **Caching Fix for `sent_tokenize` (Primary Speedup)**
**Problem**: The original code applied `@lru_cache` directly to
`sent_tokenize`, but NLTK's `_sent_tokenize` returns a `List[str]`,
which is **unhashable** and cannot be cached properly by Python's
`lru_cache`.
**Solution**: The optimized version introduces a two-layer approach:
- `_tokenize_for_cache()` - Cached function that returns `Tuple[str,
...]` (hashable)
- `sent_tokenize()` - Public wrapper that converts tuple to list
**Why it's faster**: This enables **actual caching** of tokenization
results. The test annotations show dramatic speedups (up to **35,000%
faster**) on repeated text, confirming the cache now works. Since
`sentence_count` tokenizes the same text patterns repeatedly across
function calls, this cache hit rate is crucial.
**Impact on hot paths**: Based on `function_references`, this function
is called from:
- `is_possible_narrative_text()` - checks if text contains ≥2 sentences
with `sentence_count(text, 3)`
- `is_possible_title()` - validates single-sentence constraint with
`sentence_count(text, min_length=...)`
- `exceeds_cap_ratio()` - checks sentence count to avoid multi-sentence
text
These are all text classification functions likely invoked repeatedly
during document parsing, making the caching fix highly impactful.
## 2. **Branch Prediction Optimization in `sentence_count`**
**Change**: Split the loop into two branches - one for `min_length`
case, one for no filtering:
```python
if min_length:
# Loop with filtering logic
else:
# Simple counting loop
```
**Why it's faster**:
- Eliminates repeated `if min_length:` checks inside the loop (7,181
checks in profiler)
- Allows CPU branch predictor to optimize each loop independently
- Hoists `trace_logger.detail` lookup outside loop (68 calls vs 3,046+
attribute lookups)
**Test results validation**:
- Cases **without** `min_length` show **massive speedups**
(3,000-35,000%) due to pure caching benefits
- Cases **with** `min_length` show **moderate speedups** (60-940%) since
filtering logic still executes, but benefits from reduced overhead and
hoisting
The optimization is most effective for workloads that process similar
text patterns repeatedly (common in document parsing pipelines) and
particularly when `min_length` is not specified, which appears to be the
common case based on function references.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **21 Passed** |
| 🌀 Generated Regression Tests | ✅ **60 Passed** |
| ⏪ Replay Tests | ✅ **5 Passed** |
| 🔎 Concolic Coverage Tests | ✅ **1 Passed** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Click to see Existing Unit Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:---------------------------------------------------|:--------------|:---------------|:----------|
| `partition/test_text_type.py::test_item_titles` | 47.2μs | 8.06μs |
486%✅ |
| `partition/test_text_type.py::test_sentence_count` | 4.34μs | 1.81μs |
139%✅ |
</details>
<details>
<summary>🌀 Click to see Generated Regression Tests</summary>
```python
# imports
from unstructured.partition.text_type import sentence_count
# Basic Test Cases
def test_single_sentence():
# Simple single sentence
text = "This is a test sentence."
codeflash_output = sentence_count(text) # 20.1μs -> 2.52μs (697% faster)
def test_multiple_sentences():
# Multiple sentences separated by periods
text = "This is the first sentence. This is the second sentence. Here is a third."
codeflash_output = sentence_count(text) # 62.7μs -> 1.58μs (3868% faster)
def test_sentences_with_various_punctuation():
# Sentences ending with different punctuation
text = "Is this a question? Yes! It is."
codeflash_output = sentence_count(text) # 44.1μs -> 1.48μs (2879% faster)
def test_sentence_with_min_length_none():
# min_length=None should count all sentences
text = "Short. Another one."
codeflash_output = sentence_count(text, min_length=None) # 27.0μs -> 1.59μs (1595% faster)
def test_sentence_with_min_length():
# Only sentences with at least min_length words are counted
text = "Short. This is a long enough sentence."
codeflash_output = sentence_count(text, min_length=4) # 33.2μs -> 13.5μs (146% faster)
def test_sentence_with_min_length_exact():
# Sentence with exactly min_length words should be counted
text = "One two three four."
codeflash_output = sentence_count(text, min_length=4) # 10.1μs -> 5.04μs (99.5% faster)
# Edge Test Cases
def test_empty_string():
# Empty string should return 0
codeflash_output = sentence_count("") # 5.30μs -> 1.04μs (409% faster)
def test_whitespace_only():
# String with only whitespace should return 0
codeflash_output = sentence_count(" ") # 5.26μs -> 888ns (493% faster)
def test_no_sentence_punctuation():
# Text with no sentence-ending punctuation is treated as one sentence by NLTK
text = "This is just a run on sentence with no punctuation"
codeflash_output = sentence_count(text) # 8.34μs -> 1.13μs (638% faster)
def test_sentence_with_only_punctuation():
# Sentences that are just punctuation should not be counted if min_length is set
text = "!!! ... ???"
codeflash_output = sentence_count(text, min_length=1) # 79.0μs -> 7.59μs (940% faster)
def test_sentence_with_non_ascii_punctuation():
# Sentences with Unicode punctuation
text = "This is a test sentence。This is another!"
# NLTK may not split these as sentences; check for at least 1
codeflash_output = sentence_count(text) # 10.9μs -> 1.13μs (871% faster)
def test_sentence_with_abbreviations():
# Abbreviations should not split sentences incorrectly
text = "Dr. Smith went to Washington. He arrived at 10 a.m. sharp."
codeflash_output = sentence_count(text) # 57.9μs -> 1.43μs (3959% faster)
def test_sentence_with_newlines():
# Sentences separated by newlines
text = "First sentence.\nSecond sentence!\n\nThird sentence?"
codeflash_output = sentence_count(text) # 43.2μs -> 1.34μs (3113% faster)
def test_sentence_with_multiple_spaces():
# Sentences with irregular spacing
text = "First sentence. Second sentence. "
codeflash_output = sentence_count(text) # 27.6μs -> 1.16μs (2282% faster)
def test_sentence_with_min_length_zero():
# min_length=0 should count all sentences
text = "A. B."
codeflash_output = sentence_count(text, min_length=0) # 27.7μs -> 1.38μs (1909% faster)
def test_sentence_with_min_length_greater_than_any_sentence():
# All sentences are too short for min_length
text = "A. B."
codeflash_output = sentence_count(text, min_length=10) # 5.47μs -> 6.16μs (11.2% slower)
def test_sentence_with_just_numbers():
# Sentences that are just numbers
text = "12345. 67890."
codeflash_output = sentence_count(text) # 31.7μs -> 1.29μs (2350% faster)
def test_sentence_with_only_punctuation_and_spaces():
# Only punctuation and spaces
text = " . . . "
codeflash_output = sentence_count(text) # 34.2μs -> 1.31μs (2502% faster)
def test_sentence_with_ellipsis():
# Ellipsis should not break sentence count
text = "Wait... what happened? I don't know..."
codeflash_output = sentence_count(text) # 44.7μs -> 1.36μs (3182% faster)
# Large Scale Test Cases
def test_large_number_of_sentences():
# 1000 short sentences
text = "Sentence. " * 1000
codeflash_output = sentence_count(text) # 8.26ms -> 23.5μs (35048% faster)
def test_large_text_with_long_sentences():
# 500 sentences, each with 10 words
sentence = "This is a sentence with exactly ten words."
text = " ".join([sentence for _ in range(500)])
codeflash_output = sentence_count(text) # 4.11ms -> 17.3μs (23651% faster)
def test_large_text_min_length_filtering():
# 1000 sentences, only half meet min_length
short_sentence = "Short."
long_sentence = "This is a sufficiently long sentence for testing."
text = " ".join([short_sentence, long_sentence] * 500)
codeflash_output = sentence_count(text, min_length=5) # 8.78ms -> 1.15ms (664% faster)
def test_large_text_all_filtered():
# All sentences filtered out by min_length
sentence = "A."
text = " ".join([sentence for _ in range(1000)])
codeflash_output = sentence_count(text, min_length=3) # 7.74ms -> 499μs (1450% faster)
# Regression/Mutation tests
def test_min_length_does_not_count_punctuation_as_word():
# Punctuation-only tokens should not be counted as words
text = "This . is . a . test."
# Each "is .", "a .", "test." is a sentence, but only the last is a real sentence
# NLTK will likely see this as one sentence
codeflash_output = sentence_count(text, min_length=2) # 52.5μs -> 7.96μs (560% faster)
def test_sentences_with_internal_periods():
# Internal periods (e.g., in abbreviations) do not split sentences
text = "This is Mr. Smith. He lives on St. Patrick's street."
codeflash_output = sentence_count(text) # 55.1μs -> 1.23μs (4371% faster)
def test_sentence_with_trailing_spaces_and_newlines():
# Sentences with trailing spaces and newlines
text = "First sentence. \nSecond sentence. \n"
codeflash_output = sentence_count(text) # 29.0μs -> 1.19μs (2337% faster)
def test_sentence_with_tabs():
# Sentences separated by tabs
text = "First sentence.\tSecond sentence."
codeflash_output = sentence_count(text) # 30.1μs -> 1.10μs (2645% faster)
def test_sentence_with_multiple_types_of_whitespace():
# Sentences separated by various whitespace
text = "First sentence.\n\t Second sentence.\r\nThird sentence."
codeflash_output = sentence_count(text) # 45.0μs -> 1.30μs (3373% faster)
def test_sentence_with_unicode_whitespace():
# Sentences separated by Unicode whitespace
text = "First sentence.\u2003Second sentence.\u2029Third sentence."
codeflash_output = sentence_count(text) # 47.4μs -> 1.24μs (3714% faster)
def test_sentence_with_emojis():
# Sentences containing emojis
text = "Hello world! 😀 How are you? 👍"
codeflash_output = sentence_count(text) # 47.4μs -> 1.16μs (3989% faster)
def test_sentence_with_quotes():
# Sentences with quoted text
text = "\"Hello,\" she said. 'How are you?'"
codeflash_output = sentence_count(text) # 41.7μs -> 1.07μs (3812% faster)
def test_sentence_with_parentheses():
# Sentences with parentheses
text = "This is a sentence (with parentheses). Here is another."
codeflash_output = sentence_count(text) # 31.5μs -> 1.25μs (2430% faster)
def test_sentence_with_brackets_and_braces():
# Sentences with brackets and braces
text = "This is [a test]. {Another one}."
codeflash_output = sentence_count(text) # 32.4μs -> 1.19μs (2624% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
# function to test
# For testing, we need to define the sentence_count function and its dependencies.
# We'll use the real NLTK sent_tokenize for realistic behavior.
# imports
from unstructured.partition.text_type import sentence_count
# Dummy trace_logger for completeness (no-op)
class DummyLogger:
def detail(self, msg):
pass
trace_logger = DummyLogger()
# unit tests
class TestSentenceCount:
# --- Basic Test Cases ---
def test_empty_string(self):
# Should return 0 for empty string
codeflash_output = sentence_count("") # 747ns -> 1.25μs (40.0% slower)
def test_single_sentence(self):
# Should return 1 for a simple sentence
codeflash_output = sentence_count("This is a test.") # 10.2μs -> 1.09μs (834% faster)
def test_multiple_sentences(self):
# Should return correct count for multiple sentences
codeflash_output = sentence_count(
"This is a test. Here is another sentence. And a third one!"
) # 51.5μs -> 1.38μs (3625% faster)
def test_sentences_with_varied_punctuation(self):
# Should handle sentences ending with ! and ?
codeflash_output = sentence_count(
"Is this working? Yes! It is."
) # 43.1μs -> 1.18μs (3552% faster)
def test_sentences_with_abbreviations(self):
# Should not split on abbreviations like "Dr.", "Mr.", "e.g."
text = "Dr. Smith went to Washington. He arrived at 10 a.m. sharp."
# NLTK correctly splits into 2 sentences
codeflash_output = sentence_count(text) # 4.49μs -> 1.24μs (261% faster)
def test_sentences_with_newlines(self):
# Should handle newlines between sentences
text = "First sentence.\nSecond sentence!\n\nThird sentence?"
codeflash_output = sentence_count(text) # 4.22μs -> 1.08μs (289% faster)
def test_min_length_parameter(self):
# Only sentences with >= min_length words should be counted
text = "Short. This one is long enough. Ok."
# Only "This one is long enough" has >= 4 words
codeflash_output = sentence_count(text, min_length=4) # 49.1μs -> 10.5μs (366% faster)
def test_min_length_zero(self):
# min_length=0 should count all sentences
text = "A. B. C."
codeflash_output = sentence_count(text, min_length=0) # 43.5μs -> 1.42μs (2954% faster)
def test_min_length_none(self):
# min_length=None should count all sentences
text = "A. B. C."
codeflash_output = sentence_count(text, min_length=None) # 2.09μs -> 1.28μs (63.4% faster)
# --- Edge Test Cases ---
def test_only_punctuation(self):
# Only punctuation, no words
codeflash_output = sentence_count("...!!!???") # 33.4μs -> 1.27μs (2525% faster)
def test_sentence_with_only_spaces(self):
# Spaces only should yield 0
codeflash_output = sentence_count(" ") # 5.67μs -> 862ns (557% faster)
def test_sentence_with_emoji_and_symbols(self):
# Emojis and symbols should not count as sentences
codeflash_output = sentence_count("😀 😂 🤔") # 8.09μs -> 1.16μs (598% faster)
def test_sentence_with_mixed_unicode(self):
# Should handle unicode characters and punctuation
text = "Café. Voilà! Привет мир. こんにちは世界。"
# NLTK may split Japanese as one sentence, Russian as one, etc.
# Let's check for at least 3 sentences (English, French, Russian)
codeflash_output = sentence_count(text)
count = codeflash_output # 71.8μs -> 1.34μs (5243% faster)
def test_sentence_with_no_sentence_endings(self):
# No sentence-ending punctuation, should be one sentence
text = "This is a sentence without ending punctuation"
codeflash_output = sentence_count(text) # 8.12μs -> 1.07μs (659% faster)
def test_sentence_with_ellipses(self):
# Ellipses should not break sentences
text = "Wait... what happened? I don't know..."
codeflash_output = sentence_count(text) # 3.83μs -> 1.17μs (227% faster)
def test_sentence_with_multiple_spaces_and_tabs(self):
# Should handle excessive whitespace correctly
text = "Sentence one. \t Sentence two. \n\n Sentence three."
codeflash_output = sentence_count(text) # 43.0μs -> 1.12μs (3753% faster)
def test_sentence_with_numbers_and_periods(self):
# Numbers with periods should not split sentences
text = "The value is 3.14. Next sentence."
codeflash_output = sentence_count(text) # 32.3μs -> 1.15μs (2714% faster)
def test_sentence_with_bullet_points(self):
# Should not count bullets as sentences
text = "- Item one\n- Item two\n- Item three"
codeflash_output = sentence_count(text) # 7.78μs -> 1.01μs (666% faster)
def test_sentence_with_long_word_and_min_length(self):
# One long word (no spaces) with min_length > 1 should not count
codeflash_output = sentence_count(
"Supercalifragilisticexpialidocious.", min_length=2
) # 11.3μs -> 7.04μs (59.9% faster)
def test_sentence_with_repeated_punctuation(self):
# Should not split on repeated punctuation without sentence-ending
text = "Hello!!! How are you??? Fine..."
codeflash_output = sentence_count(text) # 48.3μs -> 1.22μs (3867% faster)
def test_sentence_with_internal_periods(self):
# Internal periods (e.g., URLs) should not split sentences
text = "Check out www.example.com. This is a new sentence."
codeflash_output = sentence_count(text) # 31.0μs -> 1.22μs (2439% faster)
def test_sentence_with_parentheses_and_quotes(self):
text = 'He said, "Hello there." (And then he left.)'
# Should count as two sentences
codeflash_output = sentence_count(text) # 41.6μs -> 1.18μs (3430% faster)
# --- Large Scale Test Cases ---
def test_large_text_many_sentences(self):
# Test with 500 sentences
text = "This is a sentence. " * 500
codeflash_output = sentence_count(text) # 3.91ms -> 13.9μs (28106% faster)
def test_large_text_with_min_length(self):
# 1000 sentences, but only every other one is long enough
text = ""
for i in range(1000):
if i % 2 == 0:
text += "Short. "
else:
text += "This sentence is long enough for the test. "
# Only 500 sentences should meet min_length=5
codeflash_output = sentence_count(text, min_length=5) # 8.33ms -> 1.08ms (671% faster)
def test_large_text_no_sentence_endings(self):
# One very long sentence without punctuation
text = " ".join(["word"] * 1000)
codeflash_output = sentence_count(text) # 31.3μs -> 3.09μs (913% faster)
def test_large_text_all_too_short(self):
# 1000 one-word sentences, min_length=2, should return 0
text = ". ".join(["A"] * 1000) + "."
codeflash_output = sentence_count(text, min_length=2) # 538μs -> 502μs (7.18% faster)
def test_large_text_all_counted(self):
# 1000 sentences, all long enough
text = "This is a valid sentence. " * 1000
codeflash_output = sentence_count(text, min_length=4) # 8.46ms -> 1.12ms (655% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
from unstructured.partition.text_type import sentence_count
def test_sentence_count():
sentence_count("!", min_length=None)
```
</details>
<details>
<summary>⏪ Click to see Replay Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:---------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`test_benchmark6_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count`
| 35.2μs | 20.5μs | 72.0%✅ |
</details>
<details>
<summary>🔎 Click to see Concolic Coverage Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:-----------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`codeflash_concolic_jzsax6p2/tmpkbdw6p4k/test_concolic_coverage.py::test_sentence_count`
| 10.8μs | 2.23μs | 385%✅ |
</details>
To edit these changes `git checkout
codeflash/optimize-sentence_count-mjihf0yi` and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>