unstructured
2bc71c53 - enhancement: Speed up function `contains_verb` by 8% (#4161)

Commit
16 days ago
enhancement: Speed up function `contains_verb` by 8% (#4161) <!-- CODEFLASH_OPTIMIZATION: {"function":"contains_verb","file":"unstructured/partition/text_type.py","speedup_pct":"8%","speedup_x":"0.08x","original_runtime":"890 milliseconds","best_runtime":"827 milliseconds","optimization_type":"loop","timestamp":"2025-12-23T16:34:05.083Z","version":"1.0"} --> #### 📄 8% (0.08x) speedup for ***`contains_verb` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`890 milliseconds`** **→** **`827 milliseconds`** (best of `7` runs) #### 📝 Explanation and details The optimization achieves a **7% speedup** by replacing NLTK's sequential sentence-by-sentence POS tagging with batch processing using `pos_tag_sents`. **What Changed:** - **Batch POS tagging**: Instead of calling `_pos_tag()` individually for each sentence in a loop, the code now tokenizes all sentences first, then passes them together to `_pos_tag_sents()`. This single batched call processes all sentences at once. - **List comprehension for flattening**: The nested loop that extended `parts_of_speech` is replaced with a list comprehension that flattens the result from `_pos_tag_sents()`. **Why It's Faster:** NLTK's `pos_tag()` performs setup overhead (model loading, context initialization) on each invocation. When processing multi-sentence text, calling it N times incurs N × overhead. By contrast, `pos_tag_sents()` performs this setup once and processes all sentences in a single batch, reducing overhead from O(N) to O(1). This is particularly effective for texts with multiple sentences. **Impact Based on Context:** The `contains_verb()` function is called from `is_possible_narrative_text()`, which appears to be in a document classification/partitioning pipeline. Given that this function checks for narrative text characteristics, it likely runs on many text segments during document processing. The optimization provides: - **~9% speedup** for large-scale tests with many sentences (e.g., 200+ repeated sentences) - **5-8% speedup** for typical multi-sentence inputs - **Minimal/negative impact** on very short inputs (empty strings, single words) due to the overhead of creating intermediate lists, but these cases are typically cached via `@lru_cache` The batch processing particularly benefits workloads where `is_possible_narrative_text()` processes longer text segments with multiple sentences, which is common in document partitioning tasks. Since the function is cached, the optimization's impact is most significant on cache misses with multi-sentence text. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **23 Passed** | | 🌀 Generated Regression Tests | ✅ **108 Passed** | | ⏪ Replay Tests | ✅ **8 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Click to see Existing Unit Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_contains_verb` | 435μs | 438μs | -0.586%⚠️ | </details> <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python from __future__ import annotations from typing import Final, List # imports from unstructured.partition.text_type import contains_verb POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"] # ---- UNIT TESTS ---- # Basic Test Cases def test_simple_sentence_with_verb(): # Checks a simple sentence with an obvious verb codeflash_output = contains_verb("The cat runs.") # 203μs -> 193μs (5.46% faster) def test_simple_sentence_without_verb(): # Checks a sentence with no verb codeflash_output = contains_verb("The blue sky.") # 130μs -> 124μs (5.04% faster) def test_question_with_verb(): # Checks a question containing a verb codeflash_output = contains_verb("Is this your book?") # 95.0μs -> 92.5μs (2.73% faster) def test_sentence_with_multiple_verbs(): # Checks a sentence containing more than one verb codeflash_output = contains_verb("He jumped and ran.") # 140μs -> 132μs (6.12% faster) def test_sentence_with_verb_in_past_tense(): # Checks a sentence with a past tense verb codeflash_output = contains_verb("She walked home.") # 132μs -> 121μs (8.76% faster) def test_sentence_with_verb_in_present_participle(): # Checks a sentence with a present participle verb codeflash_output = contains_verb("The dog is barking.") # 130μs -> 124μs (4.97% faster) def test_sentence_with_verb_in_past_participle(): # Checks a sentence with a past participle verb codeflash_output = contains_verb("The cake was eaten.") # 125μs -> 121μs (4.06% faster) def test_sentence_with_modal_verb(): # Checks a sentence with a modal verb ("can" is not in POS_VERB_TAGS, but "run" is) codeflash_output = contains_verb("He can run.") # 84.0μs -> 81.7μs (2.83% faster) def test_sentence_with_no_alphabetic_characters(): # Checks a string with only punctuation codeflash_output = contains_verb("!!!") # 97.1μs -> 95.7μs (1.44% faster) def test_sentence_with_numbers_only(): # Checks a string with only numbers codeflash_output = contains_verb("1234567890") # 87.6μs -> 82.4μs (6.32% faster) # Edge Test Cases def test_empty_string(): # Checks empty input string codeflash_output = contains_verb("") # 6.38μs -> 6.66μs (4.21% slower) def test_whitespace_only(): # Checks string with only whitespace codeflash_output = contains_verb(" ") # 6.30μs -> 6.78μs (7.15% slower) def test_uppercase_sentence_with_verb(): # Checks that all-uppercase input is lowercased and verbs are detected codeflash_output = contains_verb("THE DOG BARKED.") # 131μs -> 122μs (7.51% faster) def test_uppercase_sentence_without_verb(): # Checks that all-uppercase input with no verb returns False codeflash_output = contains_verb("THE BLUE SKY.") # 123μs -> 116μs (5.93% faster) def test_sentence_with_non_ascii_characters_and_verb(): # Checks sentence with accented characters and a verb codeflash_output = contains_verb("Él corre rápido.") # 144μs -> 145μs (0.863% slower) def test_sentence_with_verb_as_ambiguous_word(): # "Run" as a noun codeflash_output = contains_verb("He went for a run.") # 88.4μs -> 87.2μs (1.38% faster) def test_sentence_with_verb_as_ambiguous_word_verb_usage(): # "Run" as a verb codeflash_output = contains_verb("He will run tomorrow.") # 88.9μs -> 86.9μs (2.35% faster) def test_sentence_with_abbreviation(): # Checks sentence with abbreviation and verb codeflash_output = contains_verb("Dr. Smith arrived.") # 136μs -> 132μs (3.40% faster) def test_sentence_with_newlines_and_tab_characters(): # Checks sentence with newlines and tabs codeflash_output = contains_verb( "The dog\nbarked.\tThe cat slept." ) # 236μs -> 220μs (7.22% faster) def test_sentence_with_only_stopwords(): # Checks sentence with only stopwords (no verbs) codeflash_output = contains_verb("and the but or") # 34.5μs -> 33.4μs (3.27% faster) def test_sentence_with_conjunctions_and_verb(): # Checks sentence with conjunctions and a verb codeflash_output = contains_verb("And then he laughed.") # 92.7μs -> 97.1μs (4.55% slower) def test_sentence_with_special_characters_and_verb(): # Checks sentence with special characters and a verb codeflash_output = contains_verb("@user replied!") # 163μs -> 153μs (6.70% faster) def test_sentence_with_url_and_verb(): # Checks sentence with a URL and a verb codeflash_output = contains_verb( "Check https://example.com and see." ) # 217μs -> 206μs (5.12% faster) def test_sentence_with_emoji_and_verb(): # Checks sentence with emoji and a verb codeflash_output = contains_verb("She runs fast 🏃‍♀️.") # 178μs -> 167μs (6.75% faster) def test_sentence_with_unicode_and_no_verb(): # Checks sentence with unicode and no verb codeflash_output = contains_verb("🍎🍏🍐") # 72.7μs -> 70.9μs (2.50% faster) def test_sentence_with_single_verb_only(): # Checks a sentence that is just a verb codeflash_output = contains_verb("Run") # 76.4μs -> 73.1μs (4.46% faster) def test_sentence_with_single_noun_only(): # Checks a sentence that is just a noun codeflash_output = contains_verb("Tree") # 78.7μs -> 73.9μs (6.45% faster) def test_sentence_with_verb_in_quotes(): # Checks a verb inside quotes codeflash_output = contains_verb('"Run" is a verb.') # 149μs -> 138μs (7.65% faster) def test_sentence_with_parentheses_and_verb(): # Checks a verb inside parentheses codeflash_output = contains_verb("He (runs) every day.") # 92.4μs -> 89.8μs (2.91% faster) def test_sentence_with_dash_and_verb(): # Checks a sentence with a dash and a verb codeflash_output = contains_verb("He - runs.") # 80.6μs -> 81.4μs (1.02% slower) def test_sentence_with_multiple_sentences_and_one_verb(): # Checks multiple sentences, only one has a verb codeflash_output = contains_verb("The blue sky. The cat runs.") # 252μs -> 248μs (1.88% faster) def test_sentence_with_multiple_sentences_no_verbs(): # Checks multiple sentences, none have verbs codeflash_output = contains_verb("The blue sky. The red car.") # 199μs -> 195μs (1.93% faster) def test_sentence_with_number_and_verb(): # Checks sentence with number and verb codeflash_output = contains_verb("There are 5 cats.") # 88.4μs -> 86.2μs (2.54% faster) def test_sentence_with_number_and_no_verb(): # Checks sentence with number and no verb codeflash_output = contains_verb("5 cats.") # 76.5μs -> 74.9μs (2.11% faster) def test_sentence_with_plural_noun_no_verb(): # Checks plural noun with no verb codeflash_output = contains_verb("Cats.") # 77.7μs -> 74.4μs (4.52% faster) def test_sentence_with_verb_and_compound_noun(): # Checks sentence with compound noun and verb codeflash_output = contains_verb("The ice-cream melts.") # 130μs -> 130μs (0.354% faster) # Large Scale Test Cases def test_large_text_with_many_verbs(): # Checks a long text with many verbs text = " ".join(["The dog runs. The cat jumps. The bird flies." for _ in range(200)]) codeflash_output = contains_verb(text) # 51.3ms -> 47.0ms (9.18% faster) def test_large_text_with_no_verbs(): # Checks a long text with no verbs text = " ".join(["The blue sky. The red car. The green grass." for _ in range(200)]) codeflash_output = contains_verb(text) # 46.4ms -> 42.5ms (9.19% faster) def test_large_text_with_verbs_in_middle(): # Checks a long text with verbs only in the middle text = ( " ".join(["The blue sky." for _ in range(100)]) + " The cat ran. " + " ".join(["The green grass." for _ in range(100)]) ) codeflash_output = contains_verb(text) # 17.0ms -> 16.1ms (5.72% faster) def test_large_text_with_uppercase_and_verbs(): # Checks a long uppercase text with verbs text = " ".join(["THE DOG RAN. THE CAT JUMPED. THE BIRD FLEW." for _ in range(200)]) codeflash_output = contains_verb(text) # 51.6ms -> 47.1ms (9.56% faster) def test_large_text_with_mixed_case_and_verbs(): # Checks a long text with mixed case and verbs text = "The dog ran. " * 500 + "the cat slept. " * 500 codeflash_output = contains_verb(text) # 83.5ms -> 77.5ms (7.64% faster) def test_large_text_with_numbers_and_no_verbs(): # Checks a long text with only numbers and no verbs text = "1234567890 " * 1000 codeflash_output = contains_verb(text) # 32.3ms -> 31.0ms (4.08% faster) def test_large_text_with_emojis_and_no_verbs(): # Checks a long text with only emojis and no verbs text = "😀😃😄😁😆😅😂🤣☺️😊 " * 100 codeflash_output = contains_verb(text) # 2.24ms -> 2.20ms (1.97% faster) def test_large_text_with_verbs_and_special_characters(): # Checks a long text with verbs and special characters text = "He runs! @user replied. #hashtag " * 300 codeflash_output = contains_verb(text) # 57.6ms -> 52.8ms (9.10% faster) def test_large_text_all_uppercase_no_verbs(): # Checks a long uppercase text with no verbs text = ("THE BLUE SKY. THE RED CAR. " * 400).strip() codeflash_output = contains_verb(text) # 55.7ms -> 52.2ms (6.80% faster) def test_large_text_with_sentences_and_newlines(): # Checks a long text with newlines and verbs text = "\n".join(["The dog barked." for _ in range(300)]) codeflash_output = contains_verb(text) # 26.0ms -> 24.0ms (8.08% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python import pytest # used for our unit tests from unstructured.partition.text_type import contains_verb # function to test # (Assume the code for pos_tag and contains_verb is as given in the prompt.) # --- Basic Test Cases --- def test_contains_verb_simple_sentence(): # Basic sentence with a single verb codeflash_output = contains_verb("The cat sleeps.") # 153μs -> 169μs (8.96% slower) def test_contains_verb_multiple_verbs(): # Sentence with multiple verbs codeflash_output = contains_verb( "She runs and jumps every morning." ) # 144μs -> 140μs (2.87% faster) def test_contains_verb_no_verb(): # Sentence with no verbs codeflash_output = contains_verb("The blue sky.") # 128μs -> 123μs (4.15% faster) def test_contains_verb_question(): # Question form with a verb codeflash_output = contains_verb("Is this your book?") # 98.0μs -> 94.5μs (3.77% faster) def test_contains_verb_negative_sentence(): # Sentence with negation codeflash_output = contains_verb("He does not like apples.") # 142μs -> 142μs (0.153% slower) def test_contains_verb_verb_ing(): # Sentence with present participle verb codeflash_output = contains_verb("Running is fun.") # 136μs -> 127μs (7.00% faster) def test_contains_verb_past_tense(): # Sentence with past tense verb codeflash_output = contains_verb("He walked home.") # 133μs -> 125μs (6.28% faster) def test_contains_verb_passive_voice(): # Passive voice sentence codeflash_output = contains_verb("The cake was eaten.") # 129μs -> 124μs (3.86% faster) def test_contains_verb_uppercase_text(): # Text in uppercase, should be normalized codeflash_output = contains_verb("THE DOG BARKED.") # 120μs -> 111μs (8.03% faster) def test_contains_verb_mixed_case_text(): # Mixed case, should work codeflash_output = contains_verb("tHe CaT SlePt.") # 151μs -> 147μs (3.01% faster) # --- Edge Test Cases --- def test_contains_verb_empty_string(): # Empty string input codeflash_output = contains_verb("") # 6.85μs -> 7.21μs (4.95% slower) def test_contains_verb_whitespace_only(): # String with only whitespace codeflash_output = contains_verb(" ") # 6.69μs -> 6.93μs (3.43% slower) def test_contains_verb_non_english(): # Non-English text (should return False as no English verbs) codeflash_output = contains_verb("これは日本語の文です。") # 91.3μs -> 88.4μs (3.33% faster) def test_contains_verb_numbers_and_symbols(): # String with only numbers and symbols codeflash_output = contains_verb("12345 !@#$%") # 177μs -> 180μs (1.75% slower) def test_contains_verb_one_word_noun(): # Single noun word codeflash_output = contains_verb("Table") # 78.6μs -> 72.2μs (8.81% faster) def test_contains_verb_one_word_verb(): # Single verb word codeflash_output = contains_verb("Run") # 74.7μs -> 73.2μs (2.02% faster) def test_contains_verb_command(): # Imperative/command sentence codeflash_output = contains_verb("Sit!") # 73.2μs -> 76.4μs (4.14% slower) def test_contains_verb_sentence_with_url(): # Sentence containing a URL codeflash_output = contains_verb( "Visit https://example.com for more info." ) # 254μs -> 244μs (4.09% faster) def test_contains_verb_sentence_with_abbreviation(): # Sentence containing abbreviations codeflash_output = contains_verb("Dr. Smith arrived.") # 129μs -> 129μs (0.051% slower) def test_contains_verb_sentence_with_apostrophe(): # Sentence with contractions codeflash_output = contains_verb("He can't go.") # 93.0μs -> 91.8μs (1.22% faster) def test_contains_verb_sentence_with_quotes(): # Sentence with quoted verb codeflash_output = contains_verb('He said, "Run!"') # 134μs -> 132μs (2.13% faster) def test_contains_verb_sentence_with_parentheses(): # Sentence with verb inside parentheses codeflash_output = contains_verb("The dog (barked) loudly.") # 159μs -> 166μs (4.20% slower) def test_contains_verb_sentence_with_no_alpha(): # String with no alphabetic characters codeflash_output = contains_verb("1234567890") # 75.7μs -> 75.5μs (0.327% faster) def test_contains_verb_sentence_with_newlines(): # Sentence with newlines codeflash_output = contains_verb("The dog\nbarked.") # 120μs -> 109μs (9.95% faster) def test_contains_verb_sentence_with_tabs(): # Sentence with tabs codeflash_output = contains_verb("The\tdog\tbarked.") # 114μs -> 104μs (9.09% faster) def test_contains_verb_sentence_with_multiple_sentences(): # Multiple sentences, at least one with a verb codeflash_output = contains_verb( "The sky. The dog barked. The tree." ) # 276μs -> 260μs (5.88% faster) def test_contains_verb_sentence_with_multiple_sentences_no_verbs(): # Multiple sentences, none with verbs codeflash_output = contains_verb( "The sky. The tree. The mountain." ) # 229μs -> 220μs (4.43% faster) def test_contains_verb_sentence_with_hyphenated_words(): # Sentence with hyphenated words and a verb codeflash_output = contains_verb( "The well-known actor performed." ) # 163μs -> 165μs (0.896% slower) def test_contains_verb_sentence_with_non_ascii_chars(): # Sentence with accented characters and a verb codeflash_output = contains_verb("José runs every day.") # 124μs -> 123μs (1.38% faster) def test_contains_verb_sentence_with_emojis(): # Sentence with emojis and a verb codeflash_output = contains_verb("He runs 🏃‍♂️ every day.") # 126μs -> 127μs (1.02% slower) def test_contains_verb_sentence_with_verb_as_noun(): # Word that can be both noun and verb, used as noun codeflash_output = contains_verb("The run was long.") # 127μs -> 135μs (6.02% slower) def test_contains_verb_sentence_with_verb_as_noun_and_verb(): # Word that can be both noun and verb, used as verb codeflash_output = contains_verb("They run every day.") # 83.9μs -> 76.5μs (9.70% faster) # --- Large Scale Test Cases --- def test_contains_verb_large_text_with_verbs(): # Large text (about 1000 words) with verbs scattered throughout text = " ".join(["He runs."] * 500 + ["The cat sleeps."] * 500) codeflash_output = contains_verb(text) # 68.4ms -> 62.7ms (9.04% faster) def test_contains_verb_large_text_no_verbs(): # Large text (about 1000 words) with no verbs text = " ".join(["The mountain."] * 1000) codeflash_output = contains_verb(text) # 57.4ms -> 53.2ms (7.83% faster) def test_contains_verb_large_text_mixed(): # Large text with verbs only in the last sentence text = " ".join(["The mountain."] * 999 + ["He runs."]) codeflash_output = contains_verb(text) # 57.8ms -> 53.1ms (8.73% faster) def test_contains_verb_large_text_all_uppercase(): # Large uppercase text with verbs, should normalize text = " ".join(["THE DOG BARKED."] * 1000) codeflash_output = contains_verb(text) # 85.5ms -> 78.6ms (8.74% faster) def test_contains_verb_large_text_with_newlines(): # Large text with newlines separating sentences text = "\n".join(["He runs."] * 1000) codeflash_output = contains_verb(text) # 53.3ms -> 49.7ms (7.36% faster) def test_contains_verb_large_text_with_numbers_and_symbols(): # Large text with numbers, symbols, and a single verb sentence text = "12345 !@#$% " * 999 + "He runs." codeflash_output = contains_verb(text) # 78.4ms -> 73.0ms (7.37% faster) def test_contains_verb_large_text_all_nouns(): # Large text with only nouns text = " ".join(["Table"] * 1000) codeflash_output = contains_verb(text) # 27.4ms -> 27.0ms (1.51% faster) def test_contains_verb_large_text_all_verbs(): # Large text with only verbs text = " ".join(["Run"] * 1000) codeflash_output = contains_verb(text) # 25.5ms -> 24.8ms (2.85% faster) # --- Mutation Testing Cases (to catch subtle bugs) --- @pytest.mark.parametrize( "text,expected", [ ("run", True), # verb, lower case ("RUN", True), # verb, upper case ("Running", True), # verb, gerund ("RAN", True), # verb, past tense ("", False), # empty (" ", False), # whitespace ("Table", False), # noun ("Table run", True), # noun and verb ("The", False), # article ("quickly", False), # adverb ("quickly run", True), # adverb + verb ("run quickly", True), # verb + adverb ("He", False), # pronoun ("He runs", True), # pronoun + verb ("He run", True), # pronoun + verb (incorrect grammar but verb present) ("He is", True), # verb 'is' ("He was", True), # verb 'was' ("He be", True), # verb 'be' ("He been", True), # verb 'been' ("He being", True), # verb 'being' ("He am", True), # verb 'am' ("He are", True), # verb 'are' ], ) def test_contains_verb_parametrized(text, expected): # Parametrized test for common verb forms and edge cases codeflash_output = contains_verb(text) # 1.07ms -> 1.05ms (2.21% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python import pytest from unstructured.partition.text_type import contains_verb def test_contains_verb(): with pytest.raises( SideEffectDetected, match='We\'ve\\ blocked\\ a\\ file\\ writing\\ operation\\ on\\ "/tmp/z0fmgvet"\\.\\ It\'s\\ dangerous\\ to\\ run\\ CrossHair\\ on\\ code\\ with\\ side\\ effects\\.\\ To\\ allow\\ this\\ operation\\ anyway,\\ use\\ "\\-\\-unblock=open:/tmp/z0fmgvet:None:655554"\\.\\ \\(or\\ some\\ colon\\-delimited\\ prefix\\)', ): contains_verb("🄰") ``` </details> <details> <summary>⏪ Click to see Replay Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `test_benchmark5_py__replay_test_0.py::test_unstructured_partition_text_type_contains_verb` | 3.19ms | 3.08ms | 3.40%✅ | </details> To edit these changes `git checkout codeflash/optimize-contains_verb-mjit1e7b` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Author
Parents
Loading