enhancement: Speed up function `ngrams` by 188% (#4177)
<!-- CODEFLASH_OPTIMIZATION:
{"function":"ngrams","file":"unstructured/utils.py","speedup_pct":"188%","speedup_x":"1.88x","original_runtime":"6.12
milliseconds","best_runtime":"2.13
milliseconds","optimization_type":"loop","timestamp":"2026-01-01T04:37:11.183Z","version":"1.0"}
-->
#### 📄 188% (1.88x) speedup for ***`ngrams` in
`unstructured/utils.py`***
⏱️ Runtime : **`6.12 milliseconds`** **→** **`2.13 milliseconds`** (best
of `138` runs)
#### 📝 Explanation and details
The optimized code achieves a **187% speedup** by replacing nested loops
with Python's efficient list slicing and comprehension. Here's why it's
faster:
## Key Optimizations
**1. List Comprehension vs Nested Loops**
- **Original**: Uses nested loops with individual element appends
(`ngram.append(s[i + j])`) - this creates and grows a temporary list
`ngram` for each n-gram, then converts it to a tuple
- **Optimized**: Uses list slicing `s[i:i+n]` which is implemented in C
and directly creates the subsequence in one operation
**2. Eliminated Redundant Operations**
The line profiler shows the original code spends:
- 35% of time in the inner loop iteration (`for j in range(n)`)
- 37% of time appending elements (`ngram.append(s[i + j])`)
- 12.5% converting lists to tuples (`tuple(ngram)`)
The optimized version eliminates all this overhead by extracting the
slice and converting it to a tuple in a single expression.
## Performance Impact by Context
The function is called in `calculate_shared_ngram_percentage()` which
operates on split text strings. This is likely used for text similarity
analysis. The optimization particularly benefits:
- **Large n-grams**: When `n` is large (e.g., `n=1000`), the speedup
reaches **1394%** because the original code's inner loop overhead scales
with `n`, while slicing remains constant time
- **Many n-grams**: For lists with 1000 elements and `n=2-3`, speedup is
**181-234%** because the outer loop runs many times
- **Hot paths**: Since this is used in text similarity calculations,
it's likely called frequently on document chunks, making even the 5-20%
gains on small inputs meaningful
## Edge Case Handling
The optimized code adds explicit handling for `n <= 0`:
- Returns empty tuples for each position when `n <= 0`, matching the
original behavior where `range(n)` with negative `n` produces no
iterations
- This is 7-9% faster for edge cases while maintaining correctness
## Test Results Summary
- **Small inputs** (3-10 elements): 5-40% faster
- **Medium inputs** (100-500 elements): 132-354% faster
- **Large inputs** (1000 elements): 181-1394% faster depending on `n`
- **Edge cases** (empty lists, `n > len`): Some are 25-30% slower due to
the empty list comprehension overhead, but these are rare cases with
negligible absolute time impact (<3μs)
The optimization trades slightly slower edge case performance for
dramatically better typical case performance, which is the right
tradeoff given the function's usage pattern in text processing.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | 🔘 **None Found** |
| 🌀 Generated Regression Tests | ✅ **58 Passed** |
| ⏪ Replay Tests | 🔘 **None Found** |
| 🔎 Concolic Coverage Tests | ✅ **1 Passed** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>🌀 Click to see Generated Regression Tests</summary>
```python
from __future__ import annotations
# imports
from unstructured.utils import ngrams
# unit tests
# -------------------- BASIC TEST CASES --------------------
def test_ngrams_basic_unigram():
# Test with n=1 (unigram)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 1)
result = codeflash_output # 4.39μs -> 4.17μs (5.30% faster)
def test_ngrams_basic_bigram():
# Test with n=2 (bigram)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.29μs -> 4.00μs (7.46% faster)
def test_ngrams_basic_trigram():
# Test with n=3 (trigram)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 3)
result = codeflash_output # 3.75μs -> 3.77μs (0.531% slower)
def test_ngrams_basic_typical_sentence():
# Test with a typical sentence split into words
s = ["the", "quick", "brown", "fox", "jumps"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 5.16μs -> 4.33μs (19.1% faster)
def test_ngrams_basic_full_ngram():
# Test where n equals the length of the list
s = ["a", "b", "c", "d"]
codeflash_output = ngrams(s, 4)
result = codeflash_output # 3.88μs -> 3.81μs (1.63% faster)
# -------------------- EDGE TEST CASES --------------------
def test_ngrams_empty_list():
# Test with an empty list
s = []
codeflash_output = ngrams(s, 2)
result = codeflash_output # 1.98μs -> 2.82μs (29.9% slower)
def test_ngrams_n_zero():
# Test with n=0, should return empty list (no 0-grams)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 0)
result = codeflash_output # 3.93μs -> 3.67μs (7.20% faster)
def test_ngrams_n_negative():
# Test with negative n, should return empty list (no negative n-grams)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, -1)
result = codeflash_output # 4.16μs -> 3.84μs (8.27% faster)
def test_ngrams_n_greater_than_len():
# Test with n greater than the length of the list
s = ["a", "b"]
codeflash_output = ngrams(s, 3)
result = codeflash_output # 2.03μs -> 2.80μs (27.4% slower)
def test_ngrams_n_equals_zero_and_empty_list():
# Test with n=0 and empty list
s = []
codeflash_output = ngrams(s, 0)
result = codeflash_output # 3.18μs -> 3.49μs (8.90% slower)
def test_ngrams_list_of_length_one():
# Test with a single element list and n=1
s = ["a"]
codeflash_output = ngrams(s, 1)
result = codeflash_output # 3.47μs -> 3.79μs (8.57% slower)
def test_ngrams_list_of_length_one_n_greater():
# Test with a single element list and n>1
s = ["a"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 2.08μs -> 2.76μs (24.5% slower)
def test_ngrams_non_ascii_characters():
# Test with non-ASCII and unicode characters
s = ["你好", "世界", "😊"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.23μs -> 4.02μs (5.07% faster)
def test_ngrams_repeated_elements():
# Test with repeated elements in the list
s = ["a", "a", "a", "a"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.78μs -> 4.26μs (12.3% faster)
def test_ngrams_with_empty_strings():
# Test with empty strings as elements
s = ["", "a", ""]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.24μs -> 4.04μs (4.85% faster)
def test_ngrams_with_mixed_types_raises():
# Test with non-string elements should raise TypeError in type-checked code, but function as written does not check
s = ["a", 1, None]
# The function will not error, but let's check that output matches tuple of elements
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.30μs -> 4.07μs (5.66% faster)
def test_ngrams_large_n_and_empty_list():
# Test with very large n and empty list
s = []
codeflash_output = ngrams(s, 100)
result = codeflash_output # 2.22μs -> 2.94μs (24.5% slower)
# -------------------- LARGE SCALE TEST CASES --------------------
def test_ngrams_large_input_unigram():
# Test with a large list and n=1 (should return all elements as singletons)
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 1)
result = codeflash_output # 372μs -> 157μs (136% faster)
def test_ngrams_large_input_bigram():
# Test with a large list and n=2 (should return len(s)-1 bigrams)
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 457μs -> 162μs (181% faster)
def test_ngrams_large_input_trigram():
# Test with a large list and n=3
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 3)
result = codeflash_output # 541μs -> 162μs (234% faster)
def test_ngrams_large_input_n_equals_length():
# Test with a large list and n equals the list length
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 1000)
result = codeflash_output # 99.9μs -> 8.80μs (1035% faster)
def test_ngrams_large_input_n_greater_than_length():
# Test with a large list and n greater than the list length
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 1001)
result = codeflash_output # 1.71μs -> 2.42μs (29.6% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
# imports
import pytest # used for our unit tests
from unstructured.utils import ngrams
# unit tests
class TestNgramsBasic:
"""Basic test cases for normal operating conditions"""
def test_bigrams_simple_sentence(self):
# Test generating bigrams (n=2) from a simple sentence
words = ["the", "quick", "brown", "fox"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.63μs -> 4.23μs (9.43% faster)
expected = [("the", "quick"), ("quick", "brown"), ("brown", "fox")]
def test_trigrams_simple_sentence(self):
# Test generating trigrams (n=3) from a simple sentence
words = ["I", "love", "to", "code"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 4.37μs -> 4.03μs (8.36% faster)
expected = [("I", "love", "to"), ("love", "to", "code")]
def test_unigrams(self):
# Test generating unigrams (n=1), should return each word as a single-element tuple
words = ["hello", "world"]
codeflash_output = ngrams(words, 1)
result = codeflash_output # 4.04μs -> 4.00μs (1.000% faster)
expected = [("hello",), ("world",)]
def test_fourgrams(self):
# Test generating 4-grams from a longer sequence
words = ["a", "b", "c", "d", "e", "f"]
codeflash_output = ngrams(words, 4)
result = codeflash_output # 5.29μs -> 4.26μs (24.0% faster)
expected = [("a", "b", "c", "d"), ("b", "c", "d", "e"), ("c", "d", "e", "f")]
def test_single_word_list_unigram(self):
# Test with a single word and n=1
words = ["hello"]
codeflash_output = ngrams(words, 1)
result = codeflash_output # 3.31μs -> 3.75μs (11.7% slower)
expected = [("hello",)]
def test_exact_length_match(self):
# Test when n equals the length of the list (should return one n-gram)
words = ["one", "two", "three"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 3.64μs -> 3.74μs (2.65% slower)
expected = [("one", "two", "three")]
def test_numeric_strings(self):
# Test with numeric strings to ensure type handling
words = ["1", "2", "3", "4", "5"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 5.03μs -> 4.26μs (17.9% faster)
expected = [("1", "2"), ("2", "3"), ("3", "4"), ("4", "5")]
def test_special_characters(self):
# Test with special characters and punctuation
words = ["Hello", ",", "world", "!", "How", "are", "you", "?"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 6.60μs -> 4.74μs (39.3% faster)
expected = [
("Hello", ",", "world"),
(",", "world", "!"),
("world", "!", "How"),
("!", "How", "are"),
("How", "are", "you"),
("are", "you", "?"),
]
class TestNgramsEdgeCases:
"""Edge cases and unusual conditions"""
def test_empty_list(self):
# Test with an empty list, should return empty list
words = []
codeflash_output = ngrams(words, 2)
result = codeflash_output # 1.91μs -> 2.74μs (30.3% slower)
expected = []
def test_n_greater_than_list_length(self):
# Test when n is greater than the list length, should return empty list
words = ["one", "two"]
codeflash_output = ngrams(words, 5)
result = codeflash_output # 1.94μs -> 2.76μs (29.6% slower)
expected = []
def test_n_equals_zero(self):
# Test with n=0, should return empty list (no 0-grams possible)
words = ["a", "b", "c"]
codeflash_output = ngrams(words, 0)
result = codeflash_output # 3.82μs -> 3.51μs (8.68% faster)
expected = []
def test_n_negative(self):
# Test with negative n, should return empty list
words = ["a", "b", "c"]
codeflash_output = ngrams(words, -1)
result = codeflash_output # 3.99μs -> 3.65μs (9.31% faster)
expected = []
def test_very_large_n(self):
# Test with very large n value, much greater than list length
words = ["a", "b"]
codeflash_output = ngrams(words, 1000)
result = codeflash_output # 2.09μs -> 2.80μs (25.5% slower)
expected = []
def test_empty_strings_in_list(self):
# Test with empty strings as elements
words = ["", "hello", "", "world", ""]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 5.21μs -> 4.36μs (19.6% faster)
expected = [("", "hello"), ("hello", ""), ("", "world"), ("world", "")]
def test_whitespace_strings(self):
# Test with whitespace-only strings
words = [" ", " ", " ", " "]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.68μs -> 4.19μs (11.6% faster)
expected = [(" ", " "), (" ", " "), (" ", " ")]
def test_duplicate_consecutive_words(self):
# Test with duplicate consecutive words
words = ["the", "the", "the", "end"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.73μs -> 4.21μs (12.3% faster)
expected = [("the", "the"), ("the", "the"), ("the", "end")]
def test_unicode_characters(self):
# Test with unicode characters
words = ["hello", "世界", "🌍", "مرحبا"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.69μs -> 4.19μs (11.9% faster)
expected = [("hello", "世界"), ("世界", "🌍"), ("🌍", "مرحبا")]
def test_very_long_strings(self):
# Test with very long individual strings
long_string = "a" * 10000
words = [long_string, "short", long_string]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.16μs -> 4.08μs (2.04% faster)
expected = [(long_string, "short"), ("short", long_string)]
def test_single_element_list_bigram(self):
# Test with single element list and n=2, should return empty
words = ["alone"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 1.94μs -> 2.79μs (30.5% slower)
expected = []
def test_two_elements_trigram(self):
# Test with two elements and n=3, should return empty
words = ["one", "two"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 1.91μs -> 2.79μs (31.5% slower)
expected = []
def test_result_is_list_of_tuples(self):
# Verify the result is a list and contains tuples
words = ["a", "b", "c"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.16μs -> 4.09μs (1.76% faster)
def test_tuples_are_immutable(self):
# Verify that returned tuples are truly tuples (immutable)
words = ["x", "y", "z"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.19μs -> 3.94μs (6.43% faster)
# Try to modify a tuple (should raise TypeError)
with pytest.raises(TypeError):
result[0][0] = "modified"
def test_original_list_unchanged(self):
# Verify the original list is not modified
words = ["a", "b", "c", "d"]
original_copy = words.copy()
ngrams(words, 2) # 4.68μs -> 4.12μs (13.7% faster)
def test_mixed_case_sensitivity(self):
# Test that function preserves case
words = ["Hello", "WORLD", "hello", "world"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.60μs -> 4.16μs (10.6% faster)
expected = [("Hello", "WORLD"), ("WORLD", "hello"), ("hello", "world")]
class TestNgramsLargeScale:
"""Large scale tests for performance and scalability"""
def test_large_list_bigrams(self):
# Test with a large list (1000 elements) generating bigrams
words = [f"word{i}" for i in range(1000)]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 456μs -> 162μs (181% faster)
def test_large_list_small_n(self):
# Test with large list and small n value
words = [f"token{i}" for i in range(500)]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 267μs -> 81.6μs (228% faster)
def test_large_list_large_n(self):
# Test with large list and large n value
words = [f"item{i}" for i in range(100)]
codeflash_output = ngrams(words, 50)
result = codeflash_output # 235μs -> 19.5μs (1108% faster)
def test_large_n_value_unigrams(self):
# Test with large list generating unigrams (should be fast)
words = [f"element{i}" for i in range(1000)]
codeflash_output = ngrams(words, 1)
result = codeflash_output # 372μs -> 160μs (132% faster)
def test_maximum_size_ngram(self):
# Test generating an n-gram that spans almost the entire list
words = [f"w{i}" for i in range(100)]
codeflash_output = ngrams(words, 99)
result = codeflash_output # 21.2μs -> 4.66μs (354% faster)
def test_many_small_ngrams(self):
# Test generating many small n-grams from a large list
words = [chr(65 + (i % 26)) for i in range(1000)] # A-Z repeated
codeflash_output = ngrams(words, 2)
result = codeflash_output # 454μs -> 160μs (183% faster)
# Verify structure is maintained
for i, ngram in enumerate(result):
pass
def test_repeated_pattern_large_scale(self):
# Test with repeated pattern in large list
pattern = ["a", "b", "c"]
words = pattern * 333 # 999 elements
codeflash_output = ngrams(words, 3)
result = codeflash_output # 544μs -> 163μs (234% faster)
# Every third n-gram should be ("a", "b", "c")
for i in range(0, len(result), 3):
if i < len(result):
pass
def test_all_unique_elements_large(self):
# Test with all unique elements in a large list
words = [f"unique_{i}_{j}" for i in range(10) for j in range(100)]
codeflash_output = ngrams(words, 5)
result = codeflash_output # 762μs -> 172μs (343% faster)
def test_memory_efficiency_check(self):
# Test that function doesn't create excessive intermediate structures
# by verifying output size is proportional to input
words = [f"mem{i}" for i in range(500)]
codeflash_output = ngrams(words, 10)
result = codeflash_output # 607μs -> 95.3μs (537% faster)
def test_boundary_conditions_large_list(self):
# Test boundary conditions with large list
words = [f"boundary{i}" for i in range(1000)]
# n = 1 (minimum meaningful n)
codeflash_output = ngrams(words, 1)
result_1 = codeflash_output # 372μs -> 158μs (135% faster)
# n = 1000 (equals list length)
codeflash_output = ngrams(words, 1000)
result_1000 = codeflash_output # 98.7μs -> 6.61μs (1394% faster)
# n = 1001 (exceeds list length)
codeflash_output = ngrams(words, 1001)
result_1001 = codeflash_output # 724ns -> 1.05μs (30.9% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
from unstructured.utils import ngrams
def test_ngrams():
ngrams([""], 1)
```
</details>
<details>
<summary>🔎 Click to see Concolic Coverage Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:---------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`codeflash_concolic_ph7c2wr0/tmphq_b3i1a/test_concolic_coverage.py::test_ngrams`
| 292μs | 292μs | 0.101%✅ |
</details>
To edit these changes `git checkout codeflash/optimize-ngrams-mjuye5a2`
and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>