enhancement: Speed up function `contains_verb` by 8% (#4161)
<!-- CODEFLASH_OPTIMIZATION:
{"function":"contains_verb","file":"unstructured/partition/text_type.py","speedup_pct":"8%","speedup_x":"0.08x","original_runtime":"890
milliseconds","best_runtime":"827
milliseconds","optimization_type":"loop","timestamp":"2025-12-23T16:34:05.083Z","version":"1.0"}
-->
#### 📄 8% (0.08x) speedup for ***`contains_verb` in
`unstructured/partition/text_type.py`***
⏱️ Runtime : **`890 milliseconds`** **→** **`827 milliseconds`** (best
of `7` runs)
#### 📝 Explanation and details
The optimization achieves a **7% speedup** by replacing NLTK's
sequential sentence-by-sentence POS tagging with batch processing using
`pos_tag_sents`.
**What Changed:**
- **Batch POS tagging**: Instead of calling `_pos_tag()` individually
for each sentence in a loop, the code now tokenizes all sentences first,
then passes them together to `_pos_tag_sents()`. This single batched
call processes all sentences at once.
- **List comprehension for flattening**: The nested loop that extended
`parts_of_speech` is replaced with a list comprehension that flattens
the result from `_pos_tag_sents()`.
**Why It's Faster:**
NLTK's `pos_tag()` performs setup overhead (model loading, context
initialization) on each invocation. When processing multi-sentence text,
calling it N times incurs N × overhead. By contrast, `pos_tag_sents()`
performs this setup once and processes all sentences in a single batch,
reducing overhead from O(N) to O(1). This is particularly effective for
texts with multiple sentences.
**Impact Based on Context:**
The `contains_verb()` function is called from
`is_possible_narrative_text()`, which appears to be in a document
classification/partitioning pipeline. Given that this function checks
for narrative text characteristics, it likely runs on many text segments
during document processing. The optimization provides:
- **~9% speedup** for large-scale tests with many sentences (e.g., 200+
repeated sentences)
- **5-8% speedup** for typical multi-sentence inputs
- **Minimal/negative impact** on very short inputs (empty strings,
single words) due to the overhead of creating intermediate lists, but
these cases are typically cached via `@lru_cache`
The batch processing particularly benefits workloads where
`is_possible_narrative_text()` processes longer text segments with
multiple sentences, which is common in document partitioning tasks.
Since the function is cached, the optimization's impact is most
significant on cache misses with multi-sentence text.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **23 Passed** |
| 🌀 Generated Regression Tests | ✅ **108 Passed** |
| ⏪ Replay Tests | ✅ **8 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Click to see Existing Unit Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:--------------------------------------------------|:--------------|:---------------|:----------|
| `partition/test_text_type.py::test_contains_verb` | 435μs | 438μs |
-0.586%⚠️ |
</details>
<details>
<summary>🌀 Click to see Generated Regression Tests</summary>
```python
from __future__ import annotations
from typing import Final, List
# imports
from unstructured.partition.text_type import contains_verb
POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]
# ---- UNIT TESTS ----
# Basic Test Cases
def test_simple_sentence_with_verb():
# Checks a simple sentence with an obvious verb
codeflash_output = contains_verb("The cat runs.") # 203μs -> 193μs (5.46% faster)
def test_simple_sentence_without_verb():
# Checks a sentence with no verb
codeflash_output = contains_verb("The blue sky.") # 130μs -> 124μs (5.04% faster)
def test_question_with_verb():
# Checks a question containing a verb
codeflash_output = contains_verb("Is this your book?") # 95.0μs -> 92.5μs (2.73% faster)
def test_sentence_with_multiple_verbs():
# Checks a sentence containing more than one verb
codeflash_output = contains_verb("He jumped and ran.") # 140μs -> 132μs (6.12% faster)
def test_sentence_with_verb_in_past_tense():
# Checks a sentence with a past tense verb
codeflash_output = contains_verb("She walked home.") # 132μs -> 121μs (8.76% faster)
def test_sentence_with_verb_in_present_participle():
# Checks a sentence with a present participle verb
codeflash_output = contains_verb("The dog is barking.") # 130μs -> 124μs (4.97% faster)
def test_sentence_with_verb_in_past_participle():
# Checks a sentence with a past participle verb
codeflash_output = contains_verb("The cake was eaten.") # 125μs -> 121μs (4.06% faster)
def test_sentence_with_modal_verb():
# Checks a sentence with a modal verb ("can" is not in POS_VERB_TAGS, but "run" is)
codeflash_output = contains_verb("He can run.") # 84.0μs -> 81.7μs (2.83% faster)
def test_sentence_with_no_alphabetic_characters():
# Checks a string with only punctuation
codeflash_output = contains_verb("!!!") # 97.1μs -> 95.7μs (1.44% faster)
def test_sentence_with_numbers_only():
# Checks a string with only numbers
codeflash_output = contains_verb("1234567890") # 87.6μs -> 82.4μs (6.32% faster)
# Edge Test Cases
def test_empty_string():
# Checks empty input string
codeflash_output = contains_verb("") # 6.38μs -> 6.66μs (4.21% slower)
def test_whitespace_only():
# Checks string with only whitespace
codeflash_output = contains_verb(" ") # 6.30μs -> 6.78μs (7.15% slower)
def test_uppercase_sentence_with_verb():
# Checks that all-uppercase input is lowercased and verbs are detected
codeflash_output = contains_verb("THE DOG BARKED.") # 131μs -> 122μs (7.51% faster)
def test_uppercase_sentence_without_verb():
# Checks that all-uppercase input with no verb returns False
codeflash_output = contains_verb("THE BLUE SKY.") # 123μs -> 116μs (5.93% faster)
def test_sentence_with_non_ascii_characters_and_verb():
# Checks sentence with accented characters and a verb
codeflash_output = contains_verb("Él corre rápido.") # 144μs -> 145μs (0.863% slower)
def test_sentence_with_verb_as_ambiguous_word():
# "Run" as a noun
codeflash_output = contains_verb("He went for a run.") # 88.4μs -> 87.2μs (1.38% faster)
def test_sentence_with_verb_as_ambiguous_word_verb_usage():
# "Run" as a verb
codeflash_output = contains_verb("He will run tomorrow.") # 88.9μs -> 86.9μs (2.35% faster)
def test_sentence_with_abbreviation():
# Checks sentence with abbreviation and verb
codeflash_output = contains_verb("Dr. Smith arrived.") # 136μs -> 132μs (3.40% faster)
def test_sentence_with_newlines_and_tab_characters():
# Checks sentence with newlines and tabs
codeflash_output = contains_verb(
"The dog\nbarked.\tThe cat slept."
) # 236μs -> 220μs (7.22% faster)
def test_sentence_with_only_stopwords():
# Checks sentence with only stopwords (no verbs)
codeflash_output = contains_verb("and the but or") # 34.5μs -> 33.4μs (3.27% faster)
def test_sentence_with_conjunctions_and_verb():
# Checks sentence with conjunctions and a verb
codeflash_output = contains_verb("And then he laughed.") # 92.7μs -> 97.1μs (4.55% slower)
def test_sentence_with_special_characters_and_verb():
# Checks sentence with special characters and a verb
codeflash_output = contains_verb("@user replied!") # 163μs -> 153μs (6.70% faster)
def test_sentence_with_url_and_verb():
# Checks sentence with a URL and a verb
codeflash_output = contains_verb(
"Check https://example.com and see."
) # 217μs -> 206μs (5.12% faster)
def test_sentence_with_emoji_and_verb():
# Checks sentence with emoji and a verb
codeflash_output = contains_verb("She runs fast 🏃♀️.") # 178μs -> 167μs (6.75% faster)
def test_sentence_with_unicode_and_no_verb():
# Checks sentence with unicode and no verb
codeflash_output = contains_verb("🍎🍏🍐") # 72.7μs -> 70.9μs (2.50% faster)
def test_sentence_with_single_verb_only():
# Checks a sentence that is just a verb
codeflash_output = contains_verb("Run") # 76.4μs -> 73.1μs (4.46% faster)
def test_sentence_with_single_noun_only():
# Checks a sentence that is just a noun
codeflash_output = contains_verb("Tree") # 78.7μs -> 73.9μs (6.45% faster)
def test_sentence_with_verb_in_quotes():
# Checks a verb inside quotes
codeflash_output = contains_verb('"Run" is a verb.') # 149μs -> 138μs (7.65% faster)
def test_sentence_with_parentheses_and_verb():
# Checks a verb inside parentheses
codeflash_output = contains_verb("He (runs) every day.") # 92.4μs -> 89.8μs (2.91% faster)
def test_sentence_with_dash_and_verb():
# Checks a sentence with a dash and a verb
codeflash_output = contains_verb("He - runs.") # 80.6μs -> 81.4μs (1.02% slower)
def test_sentence_with_multiple_sentences_and_one_verb():
# Checks multiple sentences, only one has a verb
codeflash_output = contains_verb("The blue sky. The cat runs.") # 252μs -> 248μs (1.88% faster)
def test_sentence_with_multiple_sentences_no_verbs():
# Checks multiple sentences, none have verbs
codeflash_output = contains_verb("The blue sky. The red car.") # 199μs -> 195μs (1.93% faster)
def test_sentence_with_number_and_verb():
# Checks sentence with number and verb
codeflash_output = contains_verb("There are 5 cats.") # 88.4μs -> 86.2μs (2.54% faster)
def test_sentence_with_number_and_no_verb():
# Checks sentence with number and no verb
codeflash_output = contains_verb("5 cats.") # 76.5μs -> 74.9μs (2.11% faster)
def test_sentence_with_plural_noun_no_verb():
# Checks plural noun with no verb
codeflash_output = contains_verb("Cats.") # 77.7μs -> 74.4μs (4.52% faster)
def test_sentence_with_verb_and_compound_noun():
# Checks sentence with compound noun and verb
codeflash_output = contains_verb("The ice-cream melts.") # 130μs -> 130μs (0.354% faster)
# Large Scale Test Cases
def test_large_text_with_many_verbs():
# Checks a long text with many verbs
text = " ".join(["The dog runs. The cat jumps. The bird flies." for _ in range(200)])
codeflash_output = contains_verb(text) # 51.3ms -> 47.0ms (9.18% faster)
def test_large_text_with_no_verbs():
# Checks a long text with no verbs
text = " ".join(["The blue sky. The red car. The green grass." for _ in range(200)])
codeflash_output = contains_verb(text) # 46.4ms -> 42.5ms (9.19% faster)
def test_large_text_with_verbs_in_middle():
# Checks a long text with verbs only in the middle
text = (
" ".join(["The blue sky." for _ in range(100)])
+ " The cat ran. "
+ " ".join(["The green grass." for _ in range(100)])
)
codeflash_output = contains_verb(text) # 17.0ms -> 16.1ms (5.72% faster)
def test_large_text_with_uppercase_and_verbs():
# Checks a long uppercase text with verbs
text = " ".join(["THE DOG RAN. THE CAT JUMPED. THE BIRD FLEW." for _ in range(200)])
codeflash_output = contains_verb(text) # 51.6ms -> 47.1ms (9.56% faster)
def test_large_text_with_mixed_case_and_verbs():
# Checks a long text with mixed case and verbs
text = "The dog ran. " * 500 + "the cat slept. " * 500
codeflash_output = contains_verb(text) # 83.5ms -> 77.5ms (7.64% faster)
def test_large_text_with_numbers_and_no_verbs():
# Checks a long text with only numbers and no verbs
text = "1234567890 " * 1000
codeflash_output = contains_verb(text) # 32.3ms -> 31.0ms (4.08% faster)
def test_large_text_with_emojis_and_no_verbs():
# Checks a long text with only emojis and no verbs
text = "😀😃😄😁😆😅😂🤣☺️😊 " * 100
codeflash_output = contains_verb(text) # 2.24ms -> 2.20ms (1.97% faster)
def test_large_text_with_verbs_and_special_characters():
# Checks a long text with verbs and special characters
text = "He runs! @user replied. #hashtag " * 300
codeflash_output = contains_verb(text) # 57.6ms -> 52.8ms (9.10% faster)
def test_large_text_all_uppercase_no_verbs():
# Checks a long uppercase text with no verbs
text = ("THE BLUE SKY. THE RED CAR. " * 400).strip()
codeflash_output = contains_verb(text) # 55.7ms -> 52.2ms (6.80% faster)
def test_large_text_with_sentences_and_newlines():
# Checks a long text with newlines and verbs
text = "\n".join(["The dog barked." for _ in range(300)])
codeflash_output = contains_verb(text) # 26.0ms -> 24.0ms (8.08% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
import pytest # used for our unit tests
from unstructured.partition.text_type import contains_verb
# function to test
# (Assume the code for pos_tag and contains_verb is as given in the prompt.)
# --- Basic Test Cases ---
def test_contains_verb_simple_sentence():
# Basic sentence with a single verb
codeflash_output = contains_verb("The cat sleeps.") # 153μs -> 169μs (8.96% slower)
def test_contains_verb_multiple_verbs():
# Sentence with multiple verbs
codeflash_output = contains_verb(
"She runs and jumps every morning."
) # 144μs -> 140μs (2.87% faster)
def test_contains_verb_no_verb():
# Sentence with no verbs
codeflash_output = contains_verb("The blue sky.") # 128μs -> 123μs (4.15% faster)
def test_contains_verb_question():
# Question form with a verb
codeflash_output = contains_verb("Is this your book?") # 98.0μs -> 94.5μs (3.77% faster)
def test_contains_verb_negative_sentence():
# Sentence with negation
codeflash_output = contains_verb("He does not like apples.") # 142μs -> 142μs (0.153% slower)
def test_contains_verb_verb_ing():
# Sentence with present participle verb
codeflash_output = contains_verb("Running is fun.") # 136μs -> 127μs (7.00% faster)
def test_contains_verb_past_tense():
# Sentence with past tense verb
codeflash_output = contains_verb("He walked home.") # 133μs -> 125μs (6.28% faster)
def test_contains_verb_passive_voice():
# Passive voice sentence
codeflash_output = contains_verb("The cake was eaten.") # 129μs -> 124μs (3.86% faster)
def test_contains_verb_uppercase_text():
# Text in uppercase, should be normalized
codeflash_output = contains_verb("THE DOG BARKED.") # 120μs -> 111μs (8.03% faster)
def test_contains_verb_mixed_case_text():
# Mixed case, should work
codeflash_output = contains_verb("tHe CaT SlePt.") # 151μs -> 147μs (3.01% faster)
# --- Edge Test Cases ---
def test_contains_verb_empty_string():
# Empty string input
codeflash_output = contains_verb("") # 6.85μs -> 7.21μs (4.95% slower)
def test_contains_verb_whitespace_only():
# String with only whitespace
codeflash_output = contains_verb(" ") # 6.69μs -> 6.93μs (3.43% slower)
def test_contains_verb_non_english():
# Non-English text (should return False as no English verbs)
codeflash_output = contains_verb("これは日本語の文です。") # 91.3μs -> 88.4μs (3.33% faster)
def test_contains_verb_numbers_and_symbols():
# String with only numbers and symbols
codeflash_output = contains_verb("12345 !@#$%") # 177μs -> 180μs (1.75% slower)
def test_contains_verb_one_word_noun():
# Single noun word
codeflash_output = contains_verb("Table") # 78.6μs -> 72.2μs (8.81% faster)
def test_contains_verb_one_word_verb():
# Single verb word
codeflash_output = contains_verb("Run") # 74.7μs -> 73.2μs (2.02% faster)
def test_contains_verb_command():
# Imperative/command sentence
codeflash_output = contains_verb("Sit!") # 73.2μs -> 76.4μs (4.14% slower)
def test_contains_verb_sentence_with_url():
# Sentence containing a URL
codeflash_output = contains_verb(
"Visit https://example.com for more info."
) # 254μs -> 244μs (4.09% faster)
def test_contains_verb_sentence_with_abbreviation():
# Sentence containing abbreviations
codeflash_output = contains_verb("Dr. Smith arrived.") # 129μs -> 129μs (0.051% slower)
def test_contains_verb_sentence_with_apostrophe():
# Sentence with contractions
codeflash_output = contains_verb("He can't go.") # 93.0μs -> 91.8μs (1.22% faster)
def test_contains_verb_sentence_with_quotes():
# Sentence with quoted verb
codeflash_output = contains_verb('He said, "Run!"') # 134μs -> 132μs (2.13% faster)
def test_contains_verb_sentence_with_parentheses():
# Sentence with verb inside parentheses
codeflash_output = contains_verb("The dog (barked) loudly.") # 159μs -> 166μs (4.20% slower)
def test_contains_verb_sentence_with_no_alpha():
# String with no alphabetic characters
codeflash_output = contains_verb("1234567890") # 75.7μs -> 75.5μs (0.327% faster)
def test_contains_verb_sentence_with_newlines():
# Sentence with newlines
codeflash_output = contains_verb("The dog\nbarked.") # 120μs -> 109μs (9.95% faster)
def test_contains_verb_sentence_with_tabs():
# Sentence with tabs
codeflash_output = contains_verb("The\tdog\tbarked.") # 114μs -> 104μs (9.09% faster)
def test_contains_verb_sentence_with_multiple_sentences():
# Multiple sentences, at least one with a verb
codeflash_output = contains_verb(
"The sky. The dog barked. The tree."
) # 276μs -> 260μs (5.88% faster)
def test_contains_verb_sentence_with_multiple_sentences_no_verbs():
# Multiple sentences, none with verbs
codeflash_output = contains_verb(
"The sky. The tree. The mountain."
) # 229μs -> 220μs (4.43% faster)
def test_contains_verb_sentence_with_hyphenated_words():
# Sentence with hyphenated words and a verb
codeflash_output = contains_verb(
"The well-known actor performed."
) # 163μs -> 165μs (0.896% slower)
def test_contains_verb_sentence_with_non_ascii_chars():
# Sentence with accented characters and a verb
codeflash_output = contains_verb("José runs every day.") # 124μs -> 123μs (1.38% faster)
def test_contains_verb_sentence_with_emojis():
# Sentence with emojis and a verb
codeflash_output = contains_verb("He runs 🏃♂️ every day.") # 126μs -> 127μs (1.02% slower)
def test_contains_verb_sentence_with_verb_as_noun():
# Word that can be both noun and verb, used as noun
codeflash_output = contains_verb("The run was long.") # 127μs -> 135μs (6.02% slower)
def test_contains_verb_sentence_with_verb_as_noun_and_verb():
# Word that can be both noun and verb, used as verb
codeflash_output = contains_verb("They run every day.") # 83.9μs -> 76.5μs (9.70% faster)
# --- Large Scale Test Cases ---
def test_contains_verb_large_text_with_verbs():
# Large text (about 1000 words) with verbs scattered throughout
text = " ".join(["He runs."] * 500 + ["The cat sleeps."] * 500)
codeflash_output = contains_verb(text) # 68.4ms -> 62.7ms (9.04% faster)
def test_contains_verb_large_text_no_verbs():
# Large text (about 1000 words) with no verbs
text = " ".join(["The mountain."] * 1000)
codeflash_output = contains_verb(text) # 57.4ms -> 53.2ms (7.83% faster)
def test_contains_verb_large_text_mixed():
# Large text with verbs only in the last sentence
text = " ".join(["The mountain."] * 999 + ["He runs."])
codeflash_output = contains_verb(text) # 57.8ms -> 53.1ms (8.73% faster)
def test_contains_verb_large_text_all_uppercase():
# Large uppercase text with verbs, should normalize
text = " ".join(["THE DOG BARKED."] * 1000)
codeflash_output = contains_verb(text) # 85.5ms -> 78.6ms (8.74% faster)
def test_contains_verb_large_text_with_newlines():
# Large text with newlines separating sentences
text = "\n".join(["He runs."] * 1000)
codeflash_output = contains_verb(text) # 53.3ms -> 49.7ms (7.36% faster)
def test_contains_verb_large_text_with_numbers_and_symbols():
# Large text with numbers, symbols, and a single verb sentence
text = "12345 !@#$% " * 999 + "He runs."
codeflash_output = contains_verb(text) # 78.4ms -> 73.0ms (7.37% faster)
def test_contains_verb_large_text_all_nouns():
# Large text with only nouns
text = " ".join(["Table"] * 1000)
codeflash_output = contains_verb(text) # 27.4ms -> 27.0ms (1.51% faster)
def test_contains_verb_large_text_all_verbs():
# Large text with only verbs
text = " ".join(["Run"] * 1000)
codeflash_output = contains_verb(text) # 25.5ms -> 24.8ms (2.85% faster)
# --- Mutation Testing Cases (to catch subtle bugs) ---
@pytest.mark.parametrize(
"text,expected",
[
("run", True), # verb, lower case
("RUN", True), # verb, upper case
("Running", True), # verb, gerund
("RAN", True), # verb, past tense
("", False), # empty
(" ", False), # whitespace
("Table", False), # noun
("Table run", True), # noun and verb
("The", False), # article
("quickly", False), # adverb
("quickly run", True), # adverb + verb
("run quickly", True), # verb + adverb
("He", False), # pronoun
("He runs", True), # pronoun + verb
("He run", True), # pronoun + verb (incorrect grammar but verb present)
("He is", True), # verb 'is'
("He was", True), # verb 'was'
("He be", True), # verb 'be'
("He been", True), # verb 'been'
("He being", True), # verb 'being'
("He am", True), # verb 'am'
("He are", True), # verb 'are'
],
)
def test_contains_verb_parametrized(text, expected):
# Parametrized test for common verb forms and edge cases
codeflash_output = contains_verb(text) # 1.07ms -> 1.05ms (2.21% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
import pytest
from unstructured.partition.text_type import contains_verb
def test_contains_verb():
with pytest.raises(
SideEffectDetected,
match='We\'ve\\ blocked\\ a\\ file\\ writing\\ operation\\ on\\ "/tmp/z0fmgvet"\\.\\ It\'s\\ dangerous\\ to\\ run\\ CrossHair\\ on\\ code\\ with\\ side\\ effects\\.\\ To\\ allow\\ this\\ operation\\ anyway,\\ use\\ "\\-\\-unblock=open:/tmp/z0fmgvet:None:655554"\\.\\ \\(or\\ some\\ colon\\-delimited\\ prefix\\)',
):
contains_verb("🄰")
```
</details>
<details>
<summary>⏪ Click to see Replay Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:--------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`test_benchmark5_py__replay_test_0.py::test_unstructured_partition_text_type_contains_verb`
| 3.19ms | 3.08ms | 3.40%✅ |
</details>
To edit these changes `git checkout
codeflash/optimize-contains_verb-mjit1e7b` and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>