turborepo
eba684f6 - perf: Replace `libgit2` git status with `gix-index` for faster file hashing (#11950)

Commit
2 days ago
perf: Replace `libgit2` git status with `gix-index` for faster file hashing (#11950) ## Summary Replaces `RepoGitIndex`'s libgit2-based `git ls-tree` + `git status` with a new code path that reads the `.git/index` file directly via `gix-index`. This eliminates the most expensive git operation in `turbo run` by combining two separate libgit2 calls into a single index read + parallel stat comparison. ## Results **Profile data** (`RepoGitIndex::new`): | Repo | libgit2 (before) | gix-index (after) | Improvement | |---|---|---|---| | Large (~500 packages, ~1700 tasks) | 397.8ms | 296.9ms | **-25%** | **Wall-clock benchmarks** (hyperfine, `--dry --skip-infer`, 10+ warmup, 10+ runs): | Repo | Speedup | |---|---| | Large (~500 packages) | **1.08-1.11x** | | Medium (~120 packages) | **1.20-1.35x** | | Small (~3 packages) | 1.00x | Measured with `--profile` on three private repos of different sizes. All profiles taken on the same machine, same base commit, clean working trees. The medium repo shows the biggest wall-clock improvement because git operations are a larger fraction of total run time. The large repo has a smaller relative improvement because other operations (engine build, lockfile parsing, globwalk) dominate. ## Why `git_status_repo_root` (via libgit2's `repo.statuses()`) was the single most expensive operation in `turbo run`, consuming 30-70% of total profiled duration depending on repo size. It stat-checks every tracked file AND walks the entire working tree for untracked files in a single-threaded C call. ## What Changed **New gix-index code path** (`repo_index.rs`): - Reads `.git/index` via `gix-index` (mmap'd, ~2-5ms) to get every tracked file's blob OID and cached stat data - Stats each tracked file in parallel via rayon, comparing filesystem stat against index stat using `gix_index::entry::Stat::matches()` - Racy-git entries (mtime >= index timestamp) are deferred to per-package `hash_objects` instead of content-hashing inline — avoids reading every file from disk on fresh checkouts - Uses nanosecond timestamp precision (`use_nsec: true`) to reduce false racy entries on modern filesystems (APFS, ext4) - Detects untracked files via the `ignore` crate's parallel walker (respects `.gitignore`) - Falls back to the existing libgit2 path if gix-index fails **Dependency changes:** - Added `gix-index` as an optional dependency behind a `gix` feature flag (~27 new crates, all pure Rust) **Optimizations applied:** - Removed redundant sort of `ls_tree_hashes` (git index is already sorted, rayon preserves order) - Deferred OID hex conversion — raw `ObjectId` carried through the parallel loop, hex string allocated only for clean entries - Binary search on sorted vecs instead of `HashSet` for untracked file detection **Test coverage:** - 31 regression tests covering equivalence, edge cases (gitignore, symlinks, prefix boundaries, racy-git), and contract guarantees (sorted invariants, OID compatibility, determinism) - Shared test utilities module (`test_utils.rs`)
Author
Parents
Loading