unstructured
d54cc18b - enhancement: Speed up function `detect_languages` by 5% (#4163)

Commit
12 days ago
enhancement: Speed up function `detect_languages` by 5% (#4163) <!-- CODEFLASH_OPTIMIZATION: {"function":"detect_languages","file":"unstructured/partition/common/lang.py","speedup_pct":"5%","speedup_x":"0.05x","original_runtime":"133 milliseconds","best_runtime":"127 milliseconds","optimization_type":"general","timestamp":"2025-12-23T16:16:38.424Z","version":"1.0"} --> #### 📄 5% (0.05x) speedup for ***`detect_languages` in `unstructured/partition/common/lang.py`*** ⏱️ Runtime : **`133 milliseconds`** **→** **`127 milliseconds`** (best of `14` runs) #### 📝 Explanation and details The optimized code achieves a ~5% speedup through three targeted performance improvements: ## Key Optimizations ### 1. **LRU Cache for ISO639 Language Lookups** The `iso639.Language.match()` call is expensive, consuming ~29% of `_get_iso639_language_object`'s time in the baseline. By wrapping it in `@lru_cache(maxsize=256)`, repeated lookups of the same language codes (common in real workloads) are served from cache instead of re-executing the match logic. The cache hit reduces lookup time from ~25μs to near-zero for cached entries. **Impact:** The line profiler shows `_get_iso639_language_object` time dropping from 5.28ms to 4.34ms (18% faster). Test cases with repeated language codes see 20-55% improvements (e.g., `test_large_languages_list`: 54.7% faster). ### 2. **Precompiled Regex Pattern** The ASCII detection regex `r"^[\x00-\x7F]+$"` was compiled on every call to `detect_languages()`. Moving it to module-level (`_ASCII_RE`) eliminates repeated compilation overhead. Line profiler shows this path dropping from 1.66ms to 945μs (~43% faster) when the regex is evaluated. **Impact:** Short ASCII text test cases show 20-33% speedups (e.g., `test_short_ascii_text_defaults_to_english`: 28.5% faster). ### 3. **Set-Based Deduplication** The original code checked `if lang not in doc_languages` using list membership (O(n) per check). The optimized version maintains a parallel `set` for O(1) membership checks while preserving list order for output. This is critical when `langdetect_result` returns multiple languages. **Impact:** Minimal overhead for typical cases (<5 languages), but prevents O(n²) behavior for edge cases with many detected languages. ## Workload Context Based on `function_references`, `detect_languages()` is called from `apply_lang_metadata()`, which: - Processes **batches of document elements** (potentially hundreds per document) - Calls `detect_languages()` once per element when `detect_language_per_element=True` or per-document otherwise This makes the optimizations highly effective because: - **Cache benefits compound**: The same language codes (e.g., "eng", "fra") are looked up repeatedly across elements - **Regex precompilation scales**: Short text elements trigger the ASCII check frequently - **Batch processing amplifies gains**: Even a 5% per-call improvement multiplies across document pipelines ## Test Case Patterns - **User-supplied language tests** (20-55% faster): Benefit most from cached ISO639 lookups since they bypass langdetect - **Short ASCII text tests** (20-33% faster): Benefit from precompiled regex - **Auto-detection tests** (2-10% faster): Benefit from all optimizations but are dominated by the slow `detect_langs()` library call (99.5% of runtime), limiting overall gains ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **28 Passed** | | 🌀 Generated Regression Tests | ✅ **64 Passed** | | ⏪ Replay Tests | ✅ **1 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 92.5% | <details> <summary>⚙️ Click to see Existing Unit Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:----------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/common/test_lang.py::test_detect_languages_english_auto` | 1.07ms | 926μs | 15.4%✅ | | `partition/common/test_lang.py::test_detect_languages_english_provided` | 8.99μs | 4.51μs | 99.4%✅ | | `partition/common/test_lang.py::test_detect_languages_gets_multiple_languages` | 5.47ms | 5.04ms | 8.46%✅ | | `partition/common/test_lang.py::test_detect_languages_handles_spelled_out_languages` | 10.0μs | 6.19μs | 61.6%✅ | | `partition/common/test_lang.py::test_detect_languages_korean_auto` | 267μs | 239μs | 11.7%✅ | | `partition/common/test_lang.py::test_detect_languages_raises_TypeError_for_invalid_languages` | 1.62μs | 1.57μs | 3.64%✅ | | `partition/common/test_lang.py::test_detect_languages_warns_for_auto_and_other_input` | 1.57ms | 1.44ms | 8.99%✅ | </details> <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.common.lang import detect_languages # Dummy logger for test isolation (since the real logger is not available) class DummyLogger: def debug(self, msg): pass def warning(self, msg): pass logger = DummyLogger() # Minimal TESSERACT_LANGUAGES_AND_CODES for test coverage TESSERACT_LANGUAGES_AND_CODES = { "eng": "eng", "en": "eng", "fra": "fra", "fre": "fra", "fr": "fra", "spa": "spa", "es": "spa", "deu": "deu", "de": "deu", "zho": "zho", "zh": "zho", "chi": "zho", "kor": "kor", "ko": "kor", "rus": "rus", "ru": "rus", "ita": "ita", "it": "ita", "jpn": "jpn", "ja": "jpn", } # unit tests # Basic Test Cases def test_english_detection_auto(): # Should detect English for a simple English sentence text = "This is a simple English sentence." codeflash_output = detect_languages(text) result = codeflash_output # 1.08ms -> 940μs (15.0% faster) def test_french_detection_auto(): # Should detect French for a simple French sentence text = "Ceci est une phrase en français." codeflash_output = detect_languages(text) result = codeflash_output # 1.00ms -> 912μs (9.74% faster) def test_spanish_detection_auto(): # Should detect Spanish for a simple Spanish sentence text = "Esta es una oración en español." codeflash_output = detect_languages(text) result = codeflash_output # 777μs -> 714μs (8.77% faster) def test_german_detection_auto(): # Should detect German for a simple German sentence text = "Dies ist ein deutscher Satz." codeflash_output = detect_languages(text) result = codeflash_output # 626μs -> 616μs (1.61% faster) def test_chinese_detection_auto(): # Should detect Chinese for a simple Chinese sentence text = "这是一个中文句子。" codeflash_output = detect_languages(text) result = codeflash_output # 771μs -> 722μs (6.87% faster) def test_korean_detection_auto(): # Should detect Korean for a simple Korean sentence text = "이것은 한국어 문장입니다." codeflash_output = detect_languages(text) result = codeflash_output # 272μs -> 260μs (4.76% faster) def test_russian_detection_auto(): # Should detect Russian for a simple Russian sentence text = "Это русское предложение." codeflash_output = detect_languages(text) result = codeflash_output # 863μs -> 827μs (4.34% faster) def test_japanese_detection_auto(): # Should detect Japanese for a simple Japanese sentence text = "これは日本語の文です。" codeflash_output = detect_languages(text) result = codeflash_output # 255μs -> 237μs (7.88% faster) def test_user_supplied_languages(): # Should return the user-supplied language codes in ISO 639-2/B format text = "Does not matter." codeflash_output = detect_languages(text, ["eng"]) result = codeflash_output # 5.01μs -> 4.08μs (22.8% faster) def test_user_supplied_multiple_languages(): # Should return all valid user-supplied language codes text = "Does not matter." codeflash_output = detect_languages(text, ["eng", "fra", "spa"]) result = codeflash_output # 3.74μs -> 3.18μs (17.8% faster) def test_user_supplied_language_aliases(): # Should convert aliases to ISO 639-2/B codes text = "Does not matter." codeflash_output = detect_languages(text, ["en", "fr", "es"]) result = codeflash_output # 3.51μs -> 2.89μs (21.6% faster) def test_user_supplied_language_mixed_case(): # Should handle mixed-case language codes text = "Does not matter." codeflash_output = detect_languages(text, ["EnG", "FrA"]) result = codeflash_output # 3.43μs -> 2.86μs (19.8% faster) def test_auto_overrides_user_supplied(): # Should ignore user-supplied languages if "auto" is present text = "Ceci est une phrase en français." codeflash_output = detect_languages(text, ["auto", "eng"]) result = codeflash_output # 1.78ms -> 1.65ms (8.18% faster) def test_none_languages_defaults_to_auto(): # Should default to auto if languages=None text = "Dies ist ein deutscher Satz." codeflash_output = detect_languages(text, None) result = codeflash_output # 619μs -> 583μs (6.12% faster) def test_short_ascii_text_defaults_to_english(): # Should default to English for short ASCII text text = "Hi!" codeflash_output = detect_languages(text) result = codeflash_output # 5.71μs -> 4.45μs (28.5% faster) def test_short_ascii_text_with_spaces_defaults_to_english(): # Should default to English for short ASCII text with spaces text = "Hi there" codeflash_output = detect_languages(text) result = codeflash_output # 4.05μs -> 3.31μs (22.4% faster) # Edge Test Cases def test_empty_text_returns_none(): # Should return None for empty text codeflash_output = detect_languages("") # 751ns -> 747ns (0.535% faster) def test_whitespace_text_returns_none(): # Should return None for whitespace-only text codeflash_output = detect_languages(" ") # 754ns -> 726ns (3.86% faster) def test_languages_first_element_empty_string_returns_none(): # Should return None if languages[0] == "" text = "Some text" codeflash_output = detect_languages(text, [""]) # 540ns -> 544ns (0.735% slower) def test_non_list_languages_raises_type_error(): # Should raise TypeError if languages is not a list with pytest.raises(TypeError): detect_languages("Some text", "eng") # 1.20μs -> 1.23μs (2.20% slower) def test_invalid_language_code_ignored(): # Should ignore invalid language codes in user-supplied list text = "Does not matter." codeflash_output = detect_languages(text, ["eng", "invalid_code"]) result = codeflash_output # 4.13μs -> 3.45μs (19.8% faster) def test_only_invalid_language_codes_returns_empty_list(): # Should return empty list if all user-supplied codes are invalid text = "Does not matter." codeflash_output = detect_languages(text, ["invalid1", "invalid2"]) result = codeflash_output # 3.93μs -> 2.91μs (35.0% faster) def test_text_with_special_characters(): # Should not default to English if text has special characters text = "niño año jalapeño" codeflash_output = detect_languages(text) result = codeflash_output # 705μs -> 626μs (12.7% faster) def test_text_with_multiple_languages(): # Should detect multiple languages in text (order may vary) text = "This is English. Ceci est français. Esto es español." codeflash_output = detect_languages(text) result = codeflash_output # 2.65ms -> 2.41ms (10.3% faster) def test_text_with_chinese_variants_normalizes_to_zho(): # Should normalize all Chinese variants to "zho" text = "这是中文。這是中文。這是中國話。" codeflash_output = detect_languages(text) result = codeflash_output # 454μs -> 426μs (6.63% faster) def test_text_with_unsupported_language_returns_none(): # Should return None for gibberish text (langdetect fails) text = "asdfqwerzxcv" codeflash_output = detect_languages(text) result = codeflash_output # 4.67μs -> 3.77μs (23.8% faster) def test_text_with_numbers_and_symbols(): # Should default to English for short ASCII text with numbers/symbols text = "1234!?" codeflash_output = detect_languages(text) result = codeflash_output # 3.81μs -> 2.87μs (32.8% faster) def test_text_with_long_ascii_non_english(): # Should not default to English for long ASCII text that is not English text = "Ceci est une phrase en francais sans accents mais en francais" codeflash_output = detect_languages(text) result = codeflash_output # 1.36ms -> 1.27ms (6.90% faster) def test_text_with_newlines_and_tabs(): # Should handle text with newlines and tabs text = "This is English.\nCeci est français.\tEsto es español." codeflash_output = detect_languages(text) result = codeflash_output # 2.48ms -> 2.29ms (8.09% faster) # Large Scale Test Cases def test_large_text_english(): # Should detect English in a large English text text = " ".join(["This is a sentence."] * 500) codeflash_output = detect_languages(text) result = codeflash_output # 8.37ms -> 8.16ms (2.51% faster) def test_large_text_french(): # Should detect French in a large French text text = " ".join(["Ceci est une phrase."] * 500) codeflash_output = detect_languages(text) result = codeflash_output # 9.60ms -> 9.12ms (5.21% faster) def test_large_text_mixed_languages(): # Should detect multiple languages in a large mixed-language text text = ("This is English. " * 300) + ("Ceci est français. " * 300) + ("Esto es español. " * 300) codeflash_output = detect_languages(text) result = codeflash_output # 9.71ms -> 9.30ms (4.33% faster) def test_large_user_supplied_languages(): # Should handle a large list of user-supplied languages (but only valid ones returned) text = "Does not matter." languages = ["eng"] * 500 + ["fra"] * 400 + ["invalid"] * 50 codeflash_output = detect_languages(text, languages) result = codeflash_output # 6.49μs -> 4.51μs (44.0% faster) def test_large_text_with_special_characters(): # Should detect Spanish in a large text with special characters text = "niño año jalapeño " * 500 codeflash_output = detect_languages(text) result = codeflash_output # 8.76ms -> 8.27ms (5.91% faster) def test_large_text_with_chinese_and_english(): # Should detect both Chinese and English in a large mixed text text = ("This is English. " * 400) + ("这是中文。 " * 400) codeflash_output = detect_languages(text) result = codeflash_output # 9.67ms -> 9.38ms (3.15% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from __future__ import annotations # imports import pytest # used for our unit tests from langdetect import lang_detect_exception from unstructured.partition.common.lang import detect_languages # unit tests # Basic Test Cases def test_detect_languages_english_auto(): # Basic: English text, auto detection codeflash_output = detect_languages("This is a simple English sentence.") result = codeflash_output # 1.18ms -> 1.11ms (6.25% faster) def test_detect_languages_french_auto(): # Basic: French text, auto detection codeflash_output = detect_languages("Ceci est une phrase française simple.") result = codeflash_output # 1.25ms -> 1.15ms (8.97% faster) def test_detect_languages_spanish_auto(): # Basic: Spanish text, auto detection codeflash_output = detect_languages("Esta es una oración en español.") result = codeflash_output # 794μs -> 742μs (7.04% faster) def test_detect_languages_user_input_single(): # Basic: User provides a single valid language code codeflash_output = detect_languages("Some text", ["eng"]) result = codeflash_output # 6.02μs -> 4.75μs (26.8% faster) def test_detect_languages_user_input_multiple(): # Basic: User provides multiple valid language codes codeflash_output = detect_languages("Some text", ["eng", "fra"]) result = codeflash_output # 3.68μs -> 2.83μs (29.8% faster) def test_detect_languages_user_input_nonstandard_code(): # Basic: User provides a nonstandard but mapped language code # e.g. "en" maps to "eng" via iso639 codeflash_output = detect_languages("Some text", ["en"]) result = codeflash_output # 3.68μs -> 2.80μs (31.3% faster) def test_detect_languages_auto_overrides_user_input(): # Basic: "auto" in languages overrides user input codeflash_output = detect_languages("Ceci est une phrase française simple.", ["auto", "eng"]) result = codeflash_output # 2.05ms -> 1.90ms (7.49% faster) def test_detect_languages_short_ascii_text_defaults_to_english(): # Basic: Short ASCII text should default to English codeflash_output = detect_languages("Hi!") result = codeflash_output # 5.07μs -> 4.20μs (20.7% faster) def test_detect_languages_short_non_ascii_text(): # Basic: Short non-ASCII text should not default to English codeflash_output = detect_languages("¡Hola!") result = codeflash_output # 3.21ms -> 2.94ms (9.05% faster) # Edge Test Cases def test_detect_languages_empty_text_returns_none(): # Edge: Empty string should return None codeflash_output = detect_languages("") result = codeflash_output # 759ns -> 750ns (1.20% faster) def test_detect_languages_whitespace_text_returns_none(): # Edge: Whitespace only should return None codeflash_output = detect_languages(" \n\t ") result = codeflash_output # 932ns -> 808ns (15.3% faster) def test_detect_languages_languages_empty_string_returns_none(): # Edge: languages[0] == "" should return None codeflash_output = detect_languages("Some text", [""]) result = codeflash_output # 538ns -> 517ns (4.06% faster) def test_detect_languages_languages_none_defaults_to_auto(): # Edge: languages=None should act like ["auto"] codeflash_output = detect_languages("Bonjour tout le monde", None) result = codeflash_output # 4.49μs -> 3.66μs (22.7% faster) def test_detect_languages_invalid_languages_type_raises(): # Edge: languages is not a list, should raise TypeError with pytest.raises(TypeError): detect_languages("Some text", "eng") # 1.32μs -> 1.24μs (6.64% faster) def test_detect_languages_invalid_language_code_skipped(): # Edge: User provides an invalid code, should skip it codeflash_output = detect_languages("Some text", ["eng", "notacode"]) result = codeflash_output # 3.87μs -> 3.01μs (28.7% faster) def test_detect_languages_mixed_valid_invalid_codes(): # Edge: User provides mixed valid/invalid codes codeflash_output = detect_languages("Some text", ["eng", "fra", "badcode"]) result = codeflash_output # 3.60μs -> 2.79μs (29.0% faster) def test_detect_languages_detect_langs_exception_returns_none(monkeypatch): # Edge: langdetect raises exception, should return None def raise_exception(text): raise lang_detect_exception.LangDetectException("No features in text.") monkeypatch.setattr("langdetect.detect_langs", raise_exception) codeflash_output = detect_languages("This will error out.") result = codeflash_output # 3.63μs -> 3.12μs (16.3% faster) def test_detect_languages_chinese_variant_normalization(): # Edge: Chinese variants normalized to "zho" # "你好,世界" is Chinese codeflash_output = detect_languages("你好,世界") result = codeflash_output # 2.06ms -> 1.92ms (7.65% faster) def test_detect_languages_multiple_languages_in_text(): # Edge: Mixed language text text = "Hello world. Bonjour le monde. Hola mundo." codeflash_output = detect_languages(text) result = codeflash_output # 3.92ms -> 3.70ms (5.93% faster) def test_detect_languages_duplicate_chinese_not_repeated(): # Edge: Multiple Chinese variants should not duplicate "zho" # Simulate langdetect returning zh-cn and zh-tw class DummyLangObj: def __init__(self, lang): self.lang = lang def fake_detect_langs(text): return [DummyLangObj("zh-cn"), DummyLangObj("zh-tw")] import langdetect monkeypatch = pytest.MonkeyPatch() monkeypatch.setattr(langdetect, "detect_langs", fake_detect_langs) codeflash_output = detect_languages("中文文本") result = codeflash_output # 1.00ms -> 928μs (7.89% faster) monkeypatch.undo() def test_detect_languages_non_ascii_short_text_not_default_eng(): # Edge: Short non-ascii text should not default to English codeflash_output = detect_languages("你好") result = codeflash_output # 1.37ms -> 1.26ms (8.34% faster) def test_detect_languages_tesseract_code_mapping(): # Edge: TESSERACT_LANGUAGES_AND_CODES mapping # For example, "chi_sim" should map to "zho" codeflash_output = detect_languages("Some text", ["chi_sim"]) result = codeflash_output # 4.56μs -> 3.45μs (32.0% faster) # Large Scale Test Cases def test_detect_languages_large_text_english(): # Large: Large English text text = "This is a sentence. " * 500 # 500 sentences codeflash_output = detect_languages(text) result = codeflash_output # 8.32ms -> 8.13ms (2.36% faster) def test_detect_languages_large_text_french(): # Large: Large French text text = "Ceci est une phrase. " * 500 codeflash_output = detect_languages(text) result = codeflash_output # 9.50ms -> 9.12ms (4.18% faster) def test_detect_languages_large_text_mixed(): # Large: Large mixed language text text = ( "This is an English sentence. " * 333 + "Ceci est une phrase française. " * 333 + "Esta es una oración en español. " * 333 ) codeflash_output = detect_languages(text) result = codeflash_output # 9.10ms -> 8.79ms (3.48% faster) def test_detect_languages_large_languages_list(): # Large: User provides a large list of valid codes codes = ["eng", "fra", "spa", "deu", "ita", "por", "rus", "zho"] * 10 # 80 codes codeflash_output = detect_languages("Some text", codes) result = codeflash_output # 6.75μs -> 4.37μs (54.7% faster) # Should contain all unique codes in iso639-3 form expected = ["eng", "fra", "spa", "deu", "ita", "por", "rus", "zho"] def test_detect_languages_large_invalid_codes(): # Large: User provides a large list of invalid codes codes = ["badcode" + str(i) for i in range(100)] codeflash_output = detect_languages("Some text", codes) result = codeflash_output # 3.57μs -> 3.08μs (16.2% faster) def test_detect_languages_performance_large_input(): # Large: Performance with large input (under 1000 elements) text = "Hello world! " * 999 codeflash_output = detect_languages(text) result = codeflash_output # 14.5ms -> 13.7ms (5.79% faster) def test_detect_languages_performance_large_languages_list(): # Large: Performance with large languages list (under 1000 elements) codes = ["eng"] * 999 codeflash_output = detect_languages("Some text", codes) result = codeflash_output # 6.01μs -> 3.87μs (55.5% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> <details> <summary>⏪ Click to see Replay Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `test_benchmark5_py__replay_test_0.py::test_unstructured_partition_common_lang_detect_languages` | 4.94ms | 4.78ms | 3.27%✅ | </details> To edit these changes `git checkout codeflash/optimize-detect_languages-mjisezcy` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Author
Parents
Loading