unstructured
8929336e - perf: speed up standardize_quotes with str.translate() (#4314)

Commit
7 days ago
perf: speed up standardize_quotes with str.translate() (#4314) ## Summary - Replace per-character regex with a precomputed `str.maketrans()` + `str.translate()` table for `standardize_quotes` - Covers all 36 Unicode fancy-quote codepoints (double + single) from the original regex - Adds a benchmark (`test_unstructured/benchmarks/`) to track `standardize_quotes` performance ### Benchmark (Azure Standard_D8s_v5 — 8 vCPU Intel Xeon Platinum 8473C, 32 GiB RAM) ## Benchmark: `origin/main` vs `codeflash/op` ### test_benchmark_standardize_quotes | | Min | Median | Mean | OPS | Rounds | |:---|---:|---:|---:|---:|---:| | `origin/main` (base) | 161.25μs | 199.57μs | 200.72μs | 4.98 Kops/s | 5,461 | | `codeflash/op` (head) | 99.17μs | 126.40μs | 127.86μs | 7.82 Kops/s | 10,581 | | **Speedup** | **🟢 1.63x** | **🟢 1.58x** | **🟢 1.57x** | **🟢 1.57x** | | | Function | base (μs) | head (μs) | Improvement | Speedup | |:---|---:|---:|:---|---:| | `standardize_quotes` | 128.60μs | 53.86μs | `██████░░░░` +58% | 🟢 2.39x | --- *Generated by codeflash optimization agent* <details> <summary><b>Reproduce the benchmark locally</b></summary> This PR includes a pytest-benchmark test at `test_unstructured/benchmarks/test_benchmark_standardize_quotes.py`. To run it: ```bash pip install pytest-benchmark pytest test_unstructured/benchmarks/test_benchmark_standardize_quotes.py --benchmark-only ``` To compare against `main`: ```bash # Run on main and save baseline git stash && git checkout main pytest test_unstructured/benchmarks/test_benchmark_standardize_quotes.py --benchmark-only --benchmark-save=baseline # Run on this branch and compare git checkout - && git stash pop pytest test_unstructured/benchmarks/test_benchmark_standardize_quotes.py --benchmark-only --benchmark-compare=0001_baseline ``` </details> ## Changelog Added entry in `CHANGELOG.md` under 0.22.13. ## Test plan - [x] Benchmarked on Azure VM (Standard_D8s_v5) - [x] Existing unit tests pass — `standardize_quotes` is a drop-in replacement - [x] All 36 quote codepoints covered by the translation table --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Author
Parents
Loading