enhancement: Speed up function `detect_languages` by 5% (#4163)
<!-- CODEFLASH_OPTIMIZATION:
{"function":"detect_languages","file":"unstructured/partition/common/lang.py","speedup_pct":"5%","speedup_x":"0.05x","original_runtime":"133
milliseconds","best_runtime":"127
milliseconds","optimization_type":"general","timestamp":"2025-12-23T16:16:38.424Z","version":"1.0"}
-->
#### 📄 5% (0.05x) speedup for ***`detect_languages` in
`unstructured/partition/common/lang.py`***
⏱️ Runtime : **`133 milliseconds`** **→** **`127 milliseconds`** (best
of `14` runs)
#### 📝 Explanation and details
The optimized code achieves a ~5% speedup through three targeted
performance improvements:
## Key Optimizations
### 1. **LRU Cache for ISO639 Language Lookups**
The `iso639.Language.match()` call is expensive, consuming ~29% of
`_get_iso639_language_object`'s time in the baseline. By wrapping it in
`@lru_cache(maxsize=256)`, repeated lookups of the same language codes
(common in real workloads) are served from cache instead of re-executing
the match logic. The cache hit reduces lookup time from ~25μs to
near-zero for cached entries.
**Impact:** The line profiler shows `_get_iso639_language_object` time
dropping from 5.28ms to 4.34ms (18% faster). Test cases with repeated
language codes see 20-55% improvements (e.g.,
`test_large_languages_list`: 54.7% faster).
### 2. **Precompiled Regex Pattern**
The ASCII detection regex `r"^[\x00-\x7F]+$"` was compiled on every call
to `detect_languages()`. Moving it to module-level (`_ASCII_RE`)
eliminates repeated compilation overhead. Line profiler shows this path
dropping from 1.66ms to 945μs (~43% faster) when the regex is evaluated.
**Impact:** Short ASCII text test cases show 20-33% speedups (e.g.,
`test_short_ascii_text_defaults_to_english`: 28.5% faster).
### 3. **Set-Based Deduplication**
The original code checked `if lang not in doc_languages` using list
membership (O(n) per check). The optimized version maintains a parallel
`set` for O(1) membership checks while preserving list order for output.
This is critical when `langdetect_result` returns multiple languages.
**Impact:** Minimal overhead for typical cases (<5 languages), but
prevents O(n²) behavior for edge cases with many detected languages.
## Workload Context
Based on `function_references`, `detect_languages()` is called from
`apply_lang_metadata()`, which:
- Processes **batches of document elements** (potentially hundreds per
document)
- Calls `detect_languages()` once per element when
`detect_language_per_element=True` or per-document otherwise
This makes the optimizations highly effective because:
- **Cache benefits compound**: The same language codes (e.g., "eng",
"fra") are looked up repeatedly across elements
- **Regex precompilation scales**: Short text elements trigger the ASCII
check frequently
- **Batch processing amplifies gains**: Even a 5% per-call improvement
multiplies across document pipelines
## Test Case Patterns
- **User-supplied language tests** (20-55% faster): Benefit most from
cached ISO639 lookups since they bypass langdetect
- **Short ASCII text tests** (20-33% faster): Benefit from precompiled
regex
- **Auto-detection tests** (2-10% faster): Benefit from all
optimizations but are dominated by the slow `detect_langs()` library
call (99.5% of runtime), limiting overall gains
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **28 Passed** |
| 🌀 Generated Regression Tests | ✅ **64 Passed** |
| ⏪ Replay Tests | ✅ **1 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage | 92.5% |
<details>
<summary>⚙️ Click to see Existing Unit Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:----------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/common/test_lang.py::test_detect_languages_english_auto` |
1.07ms | 926μs | 15.4%✅ |
|
`partition/common/test_lang.py::test_detect_languages_english_provided`
| 8.99μs | 4.51μs | 99.4%✅ |
|
`partition/common/test_lang.py::test_detect_languages_gets_multiple_languages`
| 5.47ms | 5.04ms | 8.46%✅ |
|
`partition/common/test_lang.py::test_detect_languages_handles_spelled_out_languages`
| 10.0μs | 6.19μs | 61.6%✅ |
| `partition/common/test_lang.py::test_detect_languages_korean_auto` |
267μs | 239μs | 11.7%✅ |
|
`partition/common/test_lang.py::test_detect_languages_raises_TypeError_for_invalid_languages`
| 1.62μs | 1.57μs | 3.64%✅ |
|
`partition/common/test_lang.py::test_detect_languages_warns_for_auto_and_other_input`
| 1.57ms | 1.44ms | 8.99%✅ |
</details>
<details>
<summary>🌀 Click to see Generated Regression Tests</summary>
```python
from __future__ import annotations
# imports
import pytest # used for our unit tests
from unstructured.partition.common.lang import detect_languages
# Dummy logger for test isolation (since the real logger is not available)
class DummyLogger:
def debug(self, msg):
pass
def warning(self, msg):
pass
logger = DummyLogger()
# Minimal TESSERACT_LANGUAGES_AND_CODES for test coverage
TESSERACT_LANGUAGES_AND_CODES = {
"eng": "eng",
"en": "eng",
"fra": "fra",
"fre": "fra",
"fr": "fra",
"spa": "spa",
"es": "spa",
"deu": "deu",
"de": "deu",
"zho": "zho",
"zh": "zho",
"chi": "zho",
"kor": "kor",
"ko": "kor",
"rus": "rus",
"ru": "rus",
"ita": "ita",
"it": "ita",
"jpn": "jpn",
"ja": "jpn",
}
# unit tests
# Basic Test Cases
def test_english_detection_auto():
# Should detect English for a simple English sentence
text = "This is a simple English sentence."
codeflash_output = detect_languages(text)
result = codeflash_output # 1.08ms -> 940μs (15.0% faster)
def test_french_detection_auto():
# Should detect French for a simple French sentence
text = "Ceci est une phrase en français."
codeflash_output = detect_languages(text)
result = codeflash_output # 1.00ms -> 912μs (9.74% faster)
def test_spanish_detection_auto():
# Should detect Spanish for a simple Spanish sentence
text = "Esta es una oración en español."
codeflash_output = detect_languages(text)
result = codeflash_output # 777μs -> 714μs (8.77% faster)
def test_german_detection_auto():
# Should detect German for a simple German sentence
text = "Dies ist ein deutscher Satz."
codeflash_output = detect_languages(text)
result = codeflash_output # 626μs -> 616μs (1.61% faster)
def test_chinese_detection_auto():
# Should detect Chinese for a simple Chinese sentence
text = "这是一个中文句子。"
codeflash_output = detect_languages(text)
result = codeflash_output # 771μs -> 722μs (6.87% faster)
def test_korean_detection_auto():
# Should detect Korean for a simple Korean sentence
text = "이것은 한국어 문장입니다."
codeflash_output = detect_languages(text)
result = codeflash_output # 272μs -> 260μs (4.76% faster)
def test_russian_detection_auto():
# Should detect Russian for a simple Russian sentence
text = "Это русское предложение."
codeflash_output = detect_languages(text)
result = codeflash_output # 863μs -> 827μs (4.34% faster)
def test_japanese_detection_auto():
# Should detect Japanese for a simple Japanese sentence
text = "これは日本語の文です。"
codeflash_output = detect_languages(text)
result = codeflash_output # 255μs -> 237μs (7.88% faster)
def test_user_supplied_languages():
# Should return the user-supplied language codes in ISO 639-2/B format
text = "Does not matter."
codeflash_output = detect_languages(text, ["eng"])
result = codeflash_output # 5.01μs -> 4.08μs (22.8% faster)
def test_user_supplied_multiple_languages():
# Should return all valid user-supplied language codes
text = "Does not matter."
codeflash_output = detect_languages(text, ["eng", "fra", "spa"])
result = codeflash_output # 3.74μs -> 3.18μs (17.8% faster)
def test_user_supplied_language_aliases():
# Should convert aliases to ISO 639-2/B codes
text = "Does not matter."
codeflash_output = detect_languages(text, ["en", "fr", "es"])
result = codeflash_output # 3.51μs -> 2.89μs (21.6% faster)
def test_user_supplied_language_mixed_case():
# Should handle mixed-case language codes
text = "Does not matter."
codeflash_output = detect_languages(text, ["EnG", "FrA"])
result = codeflash_output # 3.43μs -> 2.86μs (19.8% faster)
def test_auto_overrides_user_supplied():
# Should ignore user-supplied languages if "auto" is present
text = "Ceci est une phrase en français."
codeflash_output = detect_languages(text, ["auto", "eng"])
result = codeflash_output # 1.78ms -> 1.65ms (8.18% faster)
def test_none_languages_defaults_to_auto():
# Should default to auto if languages=None
text = "Dies ist ein deutscher Satz."
codeflash_output = detect_languages(text, None)
result = codeflash_output # 619μs -> 583μs (6.12% faster)
def test_short_ascii_text_defaults_to_english():
# Should default to English for short ASCII text
text = "Hi!"
codeflash_output = detect_languages(text)
result = codeflash_output # 5.71μs -> 4.45μs (28.5% faster)
def test_short_ascii_text_with_spaces_defaults_to_english():
# Should default to English for short ASCII text with spaces
text = "Hi there"
codeflash_output = detect_languages(text)
result = codeflash_output # 4.05μs -> 3.31μs (22.4% faster)
# Edge Test Cases
def test_empty_text_returns_none():
# Should return None for empty text
codeflash_output = detect_languages("") # 751ns -> 747ns (0.535% faster)
def test_whitespace_text_returns_none():
# Should return None for whitespace-only text
codeflash_output = detect_languages(" ") # 754ns -> 726ns (3.86% faster)
def test_languages_first_element_empty_string_returns_none():
# Should return None if languages[0] == ""
text = "Some text"
codeflash_output = detect_languages(text, [""]) # 540ns -> 544ns (0.735% slower)
def test_non_list_languages_raises_type_error():
# Should raise TypeError if languages is not a list
with pytest.raises(TypeError):
detect_languages("Some text", "eng") # 1.20μs -> 1.23μs (2.20% slower)
def test_invalid_language_code_ignored():
# Should ignore invalid language codes in user-supplied list
text = "Does not matter."
codeflash_output = detect_languages(text, ["eng", "invalid_code"])
result = codeflash_output # 4.13μs -> 3.45μs (19.8% faster)
def test_only_invalid_language_codes_returns_empty_list():
# Should return empty list if all user-supplied codes are invalid
text = "Does not matter."
codeflash_output = detect_languages(text, ["invalid1", "invalid2"])
result = codeflash_output # 3.93μs -> 2.91μs (35.0% faster)
def test_text_with_special_characters():
# Should not default to English if text has special characters
text = "niño año jalapeño"
codeflash_output = detect_languages(text)
result = codeflash_output # 705μs -> 626μs (12.7% faster)
def test_text_with_multiple_languages():
# Should detect multiple languages in text (order may vary)
text = "This is English. Ceci est français. Esto es español."
codeflash_output = detect_languages(text)
result = codeflash_output # 2.65ms -> 2.41ms (10.3% faster)
def test_text_with_chinese_variants_normalizes_to_zho():
# Should normalize all Chinese variants to "zho"
text = "这是中文。這是中文。這是中國話。"
codeflash_output = detect_languages(text)
result = codeflash_output # 454μs -> 426μs (6.63% faster)
def test_text_with_unsupported_language_returns_none():
# Should return None for gibberish text (langdetect fails)
text = "asdfqwerzxcv"
codeflash_output = detect_languages(text)
result = codeflash_output # 4.67μs -> 3.77μs (23.8% faster)
def test_text_with_numbers_and_symbols():
# Should default to English for short ASCII text with numbers/symbols
text = "1234!?"
codeflash_output = detect_languages(text)
result = codeflash_output # 3.81μs -> 2.87μs (32.8% faster)
def test_text_with_long_ascii_non_english():
# Should not default to English for long ASCII text that is not English
text = "Ceci est une phrase en francais sans accents mais en francais"
codeflash_output = detect_languages(text)
result = codeflash_output # 1.36ms -> 1.27ms (6.90% faster)
def test_text_with_newlines_and_tabs():
# Should handle text with newlines and tabs
text = "This is English.\nCeci est français.\tEsto es español."
codeflash_output = detect_languages(text)
result = codeflash_output # 2.48ms -> 2.29ms (8.09% faster)
# Large Scale Test Cases
def test_large_text_english():
# Should detect English in a large English text
text = " ".join(["This is a sentence."] * 500)
codeflash_output = detect_languages(text)
result = codeflash_output # 8.37ms -> 8.16ms (2.51% faster)
def test_large_text_french():
# Should detect French in a large French text
text = " ".join(["Ceci est une phrase."] * 500)
codeflash_output = detect_languages(text)
result = codeflash_output # 9.60ms -> 9.12ms (5.21% faster)
def test_large_text_mixed_languages():
# Should detect multiple languages in a large mixed-language text
text = ("This is English. " * 300) + ("Ceci est français. " * 300) + ("Esto es español. " * 300)
codeflash_output = detect_languages(text)
result = codeflash_output # 9.71ms -> 9.30ms (4.33% faster)
def test_large_user_supplied_languages():
# Should handle a large list of user-supplied languages (but only valid ones returned)
text = "Does not matter."
languages = ["eng"] * 500 + ["fra"] * 400 + ["invalid"] * 50
codeflash_output = detect_languages(text, languages)
result = codeflash_output # 6.49μs -> 4.51μs (44.0% faster)
def test_large_text_with_special_characters():
# Should detect Spanish in a large text with special characters
text = "niño año jalapeño " * 500
codeflash_output = detect_languages(text)
result = codeflash_output # 8.76ms -> 8.27ms (5.91% faster)
def test_large_text_with_chinese_and_english():
# Should detect both Chinese and English in a large mixed text
text = ("This is English. " * 400) + ("这是中文。 " * 400)
codeflash_output = detect_languages(text)
result = codeflash_output # 9.67ms -> 9.38ms (3.15% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
from __future__ import annotations
# imports
import pytest # used for our unit tests
from langdetect import lang_detect_exception
from unstructured.partition.common.lang import detect_languages
# unit tests
# Basic Test Cases
def test_detect_languages_english_auto():
# Basic: English text, auto detection
codeflash_output = detect_languages("This is a simple English sentence.")
result = codeflash_output # 1.18ms -> 1.11ms (6.25% faster)
def test_detect_languages_french_auto():
# Basic: French text, auto detection
codeflash_output = detect_languages("Ceci est une phrase française simple.")
result = codeflash_output # 1.25ms -> 1.15ms (8.97% faster)
def test_detect_languages_spanish_auto():
# Basic: Spanish text, auto detection
codeflash_output = detect_languages("Esta es una oración en español.")
result = codeflash_output # 794μs -> 742μs (7.04% faster)
def test_detect_languages_user_input_single():
# Basic: User provides a single valid language code
codeflash_output = detect_languages("Some text", ["eng"])
result = codeflash_output # 6.02μs -> 4.75μs (26.8% faster)
def test_detect_languages_user_input_multiple():
# Basic: User provides multiple valid language codes
codeflash_output = detect_languages("Some text", ["eng", "fra"])
result = codeflash_output # 3.68μs -> 2.83μs (29.8% faster)
def test_detect_languages_user_input_nonstandard_code():
# Basic: User provides a nonstandard but mapped language code
# e.g. "en" maps to "eng" via iso639
codeflash_output = detect_languages("Some text", ["en"])
result = codeflash_output # 3.68μs -> 2.80μs (31.3% faster)
def test_detect_languages_auto_overrides_user_input():
# Basic: "auto" in languages overrides user input
codeflash_output = detect_languages("Ceci est une phrase française simple.", ["auto", "eng"])
result = codeflash_output # 2.05ms -> 1.90ms (7.49% faster)
def test_detect_languages_short_ascii_text_defaults_to_english():
# Basic: Short ASCII text should default to English
codeflash_output = detect_languages("Hi!")
result = codeflash_output # 5.07μs -> 4.20μs (20.7% faster)
def test_detect_languages_short_non_ascii_text():
# Basic: Short non-ASCII text should not default to English
codeflash_output = detect_languages("¡Hola!")
result = codeflash_output # 3.21ms -> 2.94ms (9.05% faster)
# Edge Test Cases
def test_detect_languages_empty_text_returns_none():
# Edge: Empty string should return None
codeflash_output = detect_languages("")
result = codeflash_output # 759ns -> 750ns (1.20% faster)
def test_detect_languages_whitespace_text_returns_none():
# Edge: Whitespace only should return None
codeflash_output = detect_languages(" \n\t ")
result = codeflash_output # 932ns -> 808ns (15.3% faster)
def test_detect_languages_languages_empty_string_returns_none():
# Edge: languages[0] == "" should return None
codeflash_output = detect_languages("Some text", [""])
result = codeflash_output # 538ns -> 517ns (4.06% faster)
def test_detect_languages_languages_none_defaults_to_auto():
# Edge: languages=None should act like ["auto"]
codeflash_output = detect_languages("Bonjour tout le monde", None)
result = codeflash_output # 4.49μs -> 3.66μs (22.7% faster)
def test_detect_languages_invalid_languages_type_raises():
# Edge: languages is not a list, should raise TypeError
with pytest.raises(TypeError):
detect_languages("Some text", "eng") # 1.32μs -> 1.24μs (6.64% faster)
def test_detect_languages_invalid_language_code_skipped():
# Edge: User provides an invalid code, should skip it
codeflash_output = detect_languages("Some text", ["eng", "notacode"])
result = codeflash_output # 3.87μs -> 3.01μs (28.7% faster)
def test_detect_languages_mixed_valid_invalid_codes():
# Edge: User provides mixed valid/invalid codes
codeflash_output = detect_languages("Some text", ["eng", "fra", "badcode"])
result = codeflash_output # 3.60μs -> 2.79μs (29.0% faster)
def test_detect_languages_detect_langs_exception_returns_none(monkeypatch):
# Edge: langdetect raises exception, should return None
def raise_exception(text):
raise lang_detect_exception.LangDetectException("No features in text.")
monkeypatch.setattr("langdetect.detect_langs", raise_exception)
codeflash_output = detect_languages("This will error out.")
result = codeflash_output # 3.63μs -> 3.12μs (16.3% faster)
def test_detect_languages_chinese_variant_normalization():
# Edge: Chinese variants normalized to "zho"
# "你好,世界" is Chinese
codeflash_output = detect_languages("你好,世界")
result = codeflash_output # 2.06ms -> 1.92ms (7.65% faster)
def test_detect_languages_multiple_languages_in_text():
# Edge: Mixed language text
text = "Hello world. Bonjour le monde. Hola mundo."
codeflash_output = detect_languages(text)
result = codeflash_output # 3.92ms -> 3.70ms (5.93% faster)
def test_detect_languages_duplicate_chinese_not_repeated():
# Edge: Multiple Chinese variants should not duplicate "zho"
# Simulate langdetect returning zh-cn and zh-tw
class DummyLangObj:
def __init__(self, lang):
self.lang = lang
def fake_detect_langs(text):
return [DummyLangObj("zh-cn"), DummyLangObj("zh-tw")]
import langdetect
monkeypatch = pytest.MonkeyPatch()
monkeypatch.setattr(langdetect, "detect_langs", fake_detect_langs)
codeflash_output = detect_languages("中文文本")
result = codeflash_output # 1.00ms -> 928μs (7.89% faster)
monkeypatch.undo()
def test_detect_languages_non_ascii_short_text_not_default_eng():
# Edge: Short non-ascii text should not default to English
codeflash_output = detect_languages("你好")
result = codeflash_output # 1.37ms -> 1.26ms (8.34% faster)
def test_detect_languages_tesseract_code_mapping():
# Edge: TESSERACT_LANGUAGES_AND_CODES mapping
# For example, "chi_sim" should map to "zho"
codeflash_output = detect_languages("Some text", ["chi_sim"])
result = codeflash_output # 4.56μs -> 3.45μs (32.0% faster)
# Large Scale Test Cases
def test_detect_languages_large_text_english():
# Large: Large English text
text = "This is a sentence. " * 500 # 500 sentences
codeflash_output = detect_languages(text)
result = codeflash_output # 8.32ms -> 8.13ms (2.36% faster)
def test_detect_languages_large_text_french():
# Large: Large French text
text = "Ceci est une phrase. " * 500
codeflash_output = detect_languages(text)
result = codeflash_output # 9.50ms -> 9.12ms (4.18% faster)
def test_detect_languages_large_text_mixed():
# Large: Large mixed language text
text = (
"This is an English sentence. " * 333
+ "Ceci est une phrase française. " * 333
+ "Esta es una oración en español. " * 333
)
codeflash_output = detect_languages(text)
result = codeflash_output # 9.10ms -> 8.79ms (3.48% faster)
def test_detect_languages_large_languages_list():
# Large: User provides a large list of valid codes
codes = ["eng", "fra", "spa", "deu", "ita", "por", "rus", "zho"] * 10 # 80 codes
codeflash_output = detect_languages("Some text", codes)
result = codeflash_output # 6.75μs -> 4.37μs (54.7% faster)
# Should contain all unique codes in iso639-3 form
expected = ["eng", "fra", "spa", "deu", "ita", "por", "rus", "zho"]
def test_detect_languages_large_invalid_codes():
# Large: User provides a large list of invalid codes
codes = ["badcode" + str(i) for i in range(100)]
codeflash_output = detect_languages("Some text", codes)
result = codeflash_output # 3.57μs -> 3.08μs (16.2% faster)
def test_detect_languages_performance_large_input():
# Large: Performance with large input (under 1000 elements)
text = "Hello world! " * 999
codeflash_output = detect_languages(text)
result = codeflash_output # 14.5ms -> 13.7ms (5.79% faster)
def test_detect_languages_performance_large_languages_list():
# Large: Performance with large languages list (under 1000 elements)
codes = ["eng"] * 999
codeflash_output = detect_languages("Some text", codes)
result = codeflash_output # 6.01μs -> 3.87μs (55.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
</details>
<details>
<summary>⏪ Click to see Replay Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:-------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`test_benchmark5_py__replay_test_0.py::test_unstructured_partition_common_lang_detect_languages`
| 4.94ms | 4.78ms | 3.27%✅ |
</details>
To edit these changes `git checkout
codeflash/optimize-detect_languages-mjisezcy` and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>