turbo
b79b680f - perf: Parallelize `turbo run` pre-execution hot path (#11958)

Commit
64 days ago
perf: Parallelize `turbo run` pre-execution hot path (#11958) ## Summary Parallelizes several sequential phases of `turbo run`'s pre-execution pipeline and reduces per-call allocation overhead. Individual functions show significant improvement in `--profile` traces, though end-to-end wall-clock improvement is within noise on hyperfine benchmarks due to uninstrumented overhead (stdout serialization, daemon negotiation) dominating total runtime. ## Changes ### Parallel `to_summary` task construction (`tracker.rs`) The loop that builds `TaskSummary` structs was sequential. On a large repo with ~1700 tasks this was ~92ms. Moved to `rayon::par_iter`. Each `task_summary()` call is read-only on the engine, hash tracker (`RwLock` read), and package graph. **Profile:** `to_summary` 92ms → 10ms ### Parallel turbo.json preloading (`loader.rs`, `builder.rs`) The engine builder loaded each package's `turbo.json` lazily — sequentially on first access. Added `preload_all()` that reads and parses all package turbo.json files in parallel via rayon before the engine builder needs them. The `FixedMap` cache uses `OnceLock` per key, so concurrent loads are non-blocking. **Profile:** `build_engine` 74ms → 43ms ### Parallel `connect_internal_dependencies` (`builder.rs`, `dep_splitter.rs`) `Dependencies::new` calls that resolve internal vs external deps were sequential. Each call is read-only on the workspaces map, so moved to `rayon::par_iter`. Also hoisted `package_manager.link_workspace_packages()` (which reads a config file from disk for pnpm/Berry) above the parallel loop so it's computed once instead of N times. **Profile:** `connect_internal_dependencies` 52ms → 24ms ### Faster hex encoding in gix index (`repo_index.rs`) Replaced `e.id.to_hex().to_string()` (goes through `HexDisplay` → `Display::fmt` → heap `String`) with `hex::encode_to_slice` into a stack buffer, skipping the intermediate allocation. ### Reduce `find_untracked_files` allocations (`repo_index.rs`) Eliminated the `Vec<String>` + `Arc` that cloned every `RepoStatusEntry.path` for binary search. `status_entries` is pre-sorted before the call, and walker threads binary search directly on the borrowed `&[RepoStatusEntry]` slice. ### Reduce `TaskHashTracker` per-call overhead (`lib.rs`) Changed `external_deps_hash_cache` from `HashMap<PackageName, String>` to `HashMap<String, String>`. Lookups use `task_id.package()` directly instead of allocating a `PackageName` via `to_workspace_name()` on every `calculate_task_hash` call. ## Measurement Profile-based measurements (`turbo run build --dry=json --profile`, 5-run median on a large ~1000 package repo) show the instrumented portion of the run dropping from ~761ms to ~663ms. However, hyperfine benchmarks on `--dry` runs across three repos of varying size show no statistically significant end-to-end improvement — results are 1.00-1.03× with error bars of ±0.09 to ±0.22. ## Testing - All existing tests pass - Added a regression test for `connect_internal_dependencies` verifying graph edges and external dependency classification are correct after parallelization - Existing `git_index_regression_tests` (31 tests) validate the hex encoding and `find_untracked_files` changes - Deadlock analysis: all parallelization uses either pure read-only shared references, non-blocking `OnceLock` CAS, or read-only `RwLock` acquisition with no concurrent writers
Author
Parents
Loading