unstructured
e670864b - enhancement: Speed up function `ngrams` by 188% (#4177)

Commit

67 days ago

enhancement: Speed up function `ngrams` by 188% (#4177)  #### 📄 188% (1.88x) speedup for ***`ngrams` in `unstructured/utils.py`*** ⏱️ Runtime : **`6.12 milliseconds`** **→** **`2.13 milliseconds`** (best of `138` runs) #### 📝 Explanation and details The optimized code achieves a **187% speedup** by replacing nested loops with Python's efficient list slicing and comprehension. Here's why it's faster: ## Key Optimizations **1. List Comprehension vs Nested Loops** - **Original**: Uses nested loops with individual element appends (`ngram.append(s[i + j])`) - this creates and grows a temporary list `ngram` for each n-gram, then converts it to a tuple - **Optimized**: Uses list slicing `s[i:i+n]` which is implemented in C and directly creates the subsequence in one operation **2. Eliminated Redundant Operations** The line profiler shows the original code spends: - 35% of time in the inner loop iteration (`for j in range(n)`) - 37% of time appending elements (`ngram.append(s[i + j])`) - 12.5% converting lists to tuples (`tuple(ngram)`) The optimized version eliminates all this overhead by extracting the slice and converting it to a tuple in a single expression. ## Performance Impact by Context The function is called in `calculate_shared_ngram_percentage()` which operates on split text strings. This is likely used for text similarity analysis. The optimization particularly benefits: - **Large n-grams**: When `n` is large (e.g., `n=1000`), the speedup reaches **1394%** because the original code's inner loop overhead scales with `n`, while slicing remains constant time - **Many n-grams**: For lists with 1000 elements and `n=2-3`, speedup is **181-234%** because the outer loop runs many times - **Hot paths**: Since this is used in text similarity calculations, it's likely called frequently on document chunks, making even the 5-20% gains on small inputs meaningful ## Edge Case Handling The optimized code adds explicit handling for `n <= 0`: - Returns empty tuples for each position when `n <= 0`, matching the original behavior where `range(n)` with negative `n` produces no iterations - This is 7-9% faster for edge cases while maintaining correctness ## Test Results Summary - **Small inputs** (3-10 elements): 5-40% faster - **Medium inputs** (100-500 elements): 132-354% faster - **Large inputs** (1000 elements): 181-1394% faster depending on `n` - **Edge cases** (empty lists, `n > len`): Some are 25-30% slower due to the empty list comprehension overhead, but these are rare cases with negligible absolute time impact (<3μs) The optimization trades slightly slower edge case performance for dramatically better typical case performance, which is the right tradeoff given the function's usage pattern in text processing. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **58 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | ✅ **1 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python from __future__ import annotations # imports from unstructured.utils import ngrams # unit tests # -------------------- BASIC TEST CASES -------------------- def test_ngrams_basic_unigram(): # Test with n=1 (unigram) s = ["a", "b", "c"] codeflash_output = ngrams(s, 1) result = codeflash_output # 4.39μs -> 4.17μs (5.30% faster) def test_ngrams_basic_bigram(): # Test with n=2 (bigram) s = ["a", "b", "c"] codeflash_output = ngrams(s, 2) result = codeflash_output # 4.29μs -> 4.00μs (7.46% faster) def test_ngrams_basic_trigram(): # Test with n=3 (trigram) s = ["a", "b", "c"] codeflash_output = ngrams(s, 3) result = codeflash_output # 3.75μs -> 3.77μs (0.531% slower) def test_ngrams_basic_typical_sentence(): # Test with a typical sentence split into words s = ["the", "quick", "brown", "fox", "jumps"] codeflash_output = ngrams(s, 2) result = codeflash_output # 5.16μs -> 4.33μs (19.1% faster) def test_ngrams_basic_full_ngram(): # Test where n equals the length of the list s = ["a", "b", "c", "d"] codeflash_output = ngrams(s, 4) result = codeflash_output # 3.88μs -> 3.81μs (1.63% faster) # -------------------- EDGE TEST CASES -------------------- def test_ngrams_empty_list(): # Test with an empty list s = [] codeflash_output = ngrams(s, 2) result = codeflash_output # 1.98μs -> 2.82μs (29.9% slower) def test_ngrams_n_zero(): # Test with n=0, should return empty list (no 0-grams) s = ["a", "b", "c"] codeflash_output = ngrams(s, 0) result = codeflash_output # 3.93μs -> 3.67μs (7.20% faster) def test_ngrams_n_negative(): # Test with negative n, should return empty list (no negative n-grams) s = ["a", "b", "c"] codeflash_output = ngrams(s, -1) result = codeflash_output # 4.16μs -> 3.84μs (8.27% faster) def test_ngrams_n_greater_than_len(): # Test with n greater than the length of the list s = ["a", "b"] codeflash_output = ngrams(s, 3) result = codeflash_output # 2.03μs -> 2.80μs (27.4% slower) def test_ngrams_n_equals_zero_and_empty_list(): # Test with n=0 and empty list s = [] codeflash_output = ngrams(s, 0) result = codeflash_output # 3.18μs -> 3.49μs (8.90% slower) def test_ngrams_list_of_length_one(): # Test with a single element list and n=1 s = ["a"] codeflash_output = ngrams(s, 1) result = codeflash_output # 3.47μs -> 3.79μs (8.57% slower) def test_ngrams_list_of_length_one_n_greater(): # Test with a single element list and n>1 s = ["a"] codeflash_output = ngrams(s, 2) result = codeflash_output # 2.08μs -> 2.76μs (24.5% slower) def test_ngrams_non_ascii_characters(): # Test with non-ASCII and unicode characters s = ["你好", "世界", "😊"] codeflash_output = ngrams(s, 2) result = codeflash_output # 4.23μs -> 4.02μs (5.07% faster) def test_ngrams_repeated_elements(): # Test with repeated elements in the list s = ["a", "a", "a", "a"] codeflash_output = ngrams(s, 2) result = codeflash_output # 4.78μs -> 4.26μs (12.3% faster) def test_ngrams_with_empty_strings(): # Test with empty strings as elements s = ["", "a", ""] codeflash_output = ngrams(s, 2) result = codeflash_output # 4.24μs -> 4.04μs (4.85% faster) def test_ngrams_with_mixed_types_raises(): # Test with non-string elements should raise TypeError in type-checked code, but function as written does not check s = ["a", 1, None] # The function will not error, but let's check that output matches tuple of elements codeflash_output = ngrams(s, 2) result = codeflash_output # 4.30μs -> 4.07μs (5.66% faster) def test_ngrams_large_n_and_empty_list(): # Test with very large n and empty list s = [] codeflash_output = ngrams(s, 100) result = codeflash_output # 2.22μs -> 2.94μs (24.5% slower) # -------------------- LARGE SCALE TEST CASES -------------------- def test_ngrams_large_input_unigram(): # Test with a large list and n=1 (should return all elements as singletons) s = [str(i) for i in range(1000)] codeflash_output = ngrams(s, 1) result = codeflash_output # 372μs -> 157μs (136% faster) def test_ngrams_large_input_bigram(): # Test with a large list and n=2 (should return len(s)-1 bigrams) s = [str(i) for i in range(1000)] codeflash_output = ngrams(s, 2) result = codeflash_output # 457μs -> 162μs (181% faster) def test_ngrams_large_input_trigram(): # Test with a large list and n=3 s = [str(i) for i in range(1000)] codeflash_output = ngrams(s, 3) result = codeflash_output # 541μs -> 162μs (234% faster) def test_ngrams_large_input_n_equals_length(): # Test with a large list and n equals the list length s = [str(i) for i in range(1000)] codeflash_output = ngrams(s, 1000) result = codeflash_output # 99.9μs -> 8.80μs (1035% faster) def test_ngrams_large_input_n_greater_than_length(): # Test with a large list and n greater than the list length s = [str(i) for i in range(1000)] codeflash_output = ngrams(s, 1001) result = codeflash_output # 1.71μs -> 2.42μs (29.6% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # imports import pytest # used for our unit tests from unstructured.utils import ngrams # unit tests class TestNgramsBasic: """Basic test cases for normal operating conditions""" def test_bigrams_simple_sentence(self): # Test generating bigrams (n=2) from a simple sentence words = ["the", "quick", "brown", "fox"] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.63μs -> 4.23μs (9.43% faster) expected = [("the", "quick"), ("quick", "brown"), ("brown", "fox")] def test_trigrams_simple_sentence(self): # Test generating trigrams (n=3) from a simple sentence words = ["I", "love", "to", "code"] codeflash_output = ngrams(words, 3) result = codeflash_output # 4.37μs -> 4.03μs (8.36% faster) expected = [("I", "love", "to"), ("love", "to", "code")] def test_unigrams(self): # Test generating unigrams (n=1), should return each word as a single-element tuple words = ["hello", "world"] codeflash_output = ngrams(words, 1) result = codeflash_output # 4.04μs -> 4.00μs (1.000% faster) expected = [("hello",), ("world",)] def test_fourgrams(self): # Test generating 4-grams from a longer sequence words = ["a", "b", "c", "d", "e", "f"] codeflash_output = ngrams(words, 4) result = codeflash_output # 5.29μs -> 4.26μs (24.0% faster) expected = [("a", "b", "c", "d"), ("b", "c", "d", "e"), ("c", "d", "e", "f")] def test_single_word_list_unigram(self): # Test with a single word and n=1 words = ["hello"] codeflash_output = ngrams(words, 1) result = codeflash_output # 3.31μs -> 3.75μs (11.7% slower) expected = [("hello",)] def test_exact_length_match(self): # Test when n equals the length of the list (should return one n-gram) words = ["one", "two", "three"] codeflash_output = ngrams(words, 3) result = codeflash_output # 3.64μs -> 3.74μs (2.65% slower) expected = [("one", "two", "three")] def test_numeric_strings(self): # Test with numeric strings to ensure type handling words = ["1", "2", "3", "4", "5"] codeflash_output = ngrams(words, 2) result = codeflash_output # 5.03μs -> 4.26μs (17.9% faster) expected = [("1", "2"), ("2", "3"), ("3", "4"), ("4", "5")] def test_special_characters(self): # Test with special characters and punctuation words = ["Hello", ",", "world", "!", "How", "are", "you", "?"] codeflash_output = ngrams(words, 3) result = codeflash_output # 6.60μs -> 4.74μs (39.3% faster) expected = [ ("Hello", ",", "world"), (",", "world", "!"), ("world", "!", "How"), ("!", "How", "are"), ("How", "are", "you"), ("are", "you", "?"), ] class TestNgramsEdgeCases: """Edge cases and unusual conditions""" def test_empty_list(self): # Test with an empty list, should return empty list words = [] codeflash_output = ngrams(words, 2) result = codeflash_output # 1.91μs -> 2.74μs (30.3% slower) expected = [] def test_n_greater_than_list_length(self): # Test when n is greater than the list length, should return empty list words = ["one", "two"] codeflash_output = ngrams(words, 5) result = codeflash_output # 1.94μs -> 2.76μs (29.6% slower) expected = [] def test_n_equals_zero(self): # Test with n=0, should return empty list (no 0-grams possible) words = ["a", "b", "c"] codeflash_output = ngrams(words, 0) result = codeflash_output # 3.82μs -> 3.51μs (8.68% faster) expected = [] def test_n_negative(self): # Test with negative n, should return empty list words = ["a", "b", "c"] codeflash_output = ngrams(words, -1) result = codeflash_output # 3.99μs -> 3.65μs (9.31% faster) expected = [] def test_very_large_n(self): # Test with very large n value, much greater than list length words = ["a", "b"] codeflash_output = ngrams(words, 1000) result = codeflash_output # 2.09μs -> 2.80μs (25.5% slower) expected = [] def test_empty_strings_in_list(self): # Test with empty strings as elements words = ["", "hello", "", "world", ""] codeflash_output = ngrams(words, 2) result = codeflash_output # 5.21μs -> 4.36μs (19.6% faster) expected = [("", "hello"), ("hello", ""), ("", "world"), ("world", "")] def test_whitespace_strings(self): # Test with whitespace-only strings words = [" ", " ", " ", " "] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.68μs -> 4.19μs (11.6% faster) expected = [(" ", " "), (" ", " "), (" ", " ")] def test_duplicate_consecutive_words(self): # Test with duplicate consecutive words words = ["the", "the", "the", "end"] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.73μs -> 4.21μs (12.3% faster) expected = [("the", "the"), ("the", "the"), ("the", "end")] def test_unicode_characters(self): # Test with unicode characters words = ["hello", "世界", "🌍", "مرحبا"] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.69μs -> 4.19μs (11.9% faster) expected = [("hello", "世界"), ("世界", "🌍"), ("🌍", "مرحبا")] def test_very_long_strings(self): # Test with very long individual strings long_string = "a" * 10000 words = [long_string, "short", long_string] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.16μs -> 4.08μs (2.04% faster) expected = [(long_string, "short"), ("short", long_string)] def test_single_element_list_bigram(self): # Test with single element list and n=2, should return empty words = ["alone"] codeflash_output = ngrams(words, 2) result = codeflash_output # 1.94μs -> 2.79μs (30.5% slower) expected = [] def test_two_elements_trigram(self): # Test with two elements and n=3, should return empty words = ["one", "two"] codeflash_output = ngrams(words, 3) result = codeflash_output # 1.91μs -> 2.79μs (31.5% slower) expected = [] def test_result_is_list_of_tuples(self): # Verify the result is a list and contains tuples words = ["a", "b", "c"] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.16μs -> 4.09μs (1.76% faster) def test_tuples_are_immutable(self): # Verify that returned tuples are truly tuples (immutable) words = ["x", "y", "z"] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.19μs -> 3.94μs (6.43% faster) # Try to modify a tuple (should raise TypeError) with pytest.raises(TypeError): result[0][0] = "modified" def test_original_list_unchanged(self): # Verify the original list is not modified words = ["a", "b", "c", "d"] original_copy = words.copy() ngrams(words, 2) # 4.68μs -> 4.12μs (13.7% faster) def test_mixed_case_sensitivity(self): # Test that function preserves case words = ["Hello", "WORLD", "hello", "world"] codeflash_output = ngrams(words, 2) result = codeflash_output # 4.60μs -> 4.16μs (10.6% faster) expected = [("Hello", "WORLD"), ("WORLD", "hello"), ("hello", "world")] class TestNgramsLargeScale: """Large scale tests for performance and scalability""" def test_large_list_bigrams(self): # Test with a large list (1000 elements) generating bigrams words = [f"word{i}" for i in range(1000)] codeflash_output = ngrams(words, 2) result = codeflash_output # 456μs -> 162μs (181% faster) def test_large_list_small_n(self): # Test with large list and small n value words = [f"token{i}" for i in range(500)] codeflash_output = ngrams(words, 3) result = codeflash_output # 267μs -> 81.6μs (228% faster) def test_large_list_large_n(self): # Test with large list and large n value words = [f"item{i}" for i in range(100)] codeflash_output = ngrams(words, 50) result = codeflash_output # 235μs -> 19.5μs (1108% faster) def test_large_n_value_unigrams(self): # Test with large list generating unigrams (should be fast) words = [f"element{i}" for i in range(1000)] codeflash_output = ngrams(words, 1) result = codeflash_output # 372μs -> 160μs (132% faster) def test_maximum_size_ngram(self): # Test generating an n-gram that spans almost the entire list words = [f"w{i}" for i in range(100)] codeflash_output = ngrams(words, 99) result = codeflash_output # 21.2μs -> 4.66μs (354% faster) def test_many_small_ngrams(self): # Test generating many small n-grams from a large list words = [chr(65 + (i % 26)) for i in range(1000)] # A-Z repeated codeflash_output = ngrams(words, 2) result = codeflash_output # 454μs -> 160μs (183% faster) # Verify structure is maintained for i, ngram in enumerate(result): pass def test_repeated_pattern_large_scale(self): # Test with repeated pattern in large list pattern = ["a", "b", "c"] words = pattern * 333 # 999 elements codeflash_output = ngrams(words, 3) result = codeflash_output # 544μs -> 163μs (234% faster) # Every third n-gram should be ("a", "b", "c") for i in range(0, len(result), 3): if i < len(result): pass def test_all_unique_elements_large(self): # Test with all unique elements in a large list words = [f"unique_{i}_{j}" for i in range(10) for j in range(100)] codeflash_output = ngrams(words, 5) result = codeflash_output # 762μs -> 172μs (343% faster) def test_memory_efficiency_check(self): # Test that function doesn't create excessive intermediate structures # by verifying output size is proportional to input words = [f"mem{i}" for i in range(500)] codeflash_output = ngrams(words, 10) result = codeflash_output # 607μs -> 95.3μs (537% faster) def test_boundary_conditions_large_list(self): # Test boundary conditions with large list words = [f"boundary{i}" for i in range(1000)] # n = 1 (minimum meaningful n) codeflash_output = ngrams(words, 1) result_1 = codeflash_output # 372μs -> 158μs (135% faster) # n = 1000 (equals list length) codeflash_output = ngrams(words, 1000) result_1000 = codeflash_output # 98.7μs -> 6.61μs (1394% faster) # n = 1001 (exceeds list length) codeflash_output = ngrams(words, 1001) result_1001 = codeflash_output # 724ns -> 1.05μs (30.9% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from unstructured.utils import ngrams def test_ngrams(): ngrams([""], 1) ``` </details> <details> <summary>🔎 Click to see Concolic Coverage Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_ph7c2wr0/tmphq_b3i1a/test_concolic_coverage.py::test_ngrams` | 292μs | 292μs | 0.101%✅ | </details> To edit these changes `git checkout codeflash/optimize-ngrams-mjuye5a2` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>

References

#4177 - ⚡️ Speed up function `ngrams` by 188%

Author

aseembits93

Parents

0581e3c1

unstructured e670864b - enhancement: Speed up function `ngrams` by 188% (#4177)

unstructured
e670864b - enhancement: Speed up function `ngrams` by 188% (#4177)