perf: Parallelize `turbo run` pre-execution hot path (#11958)
## Summary
Parallelizes several sequential phases of `turbo run`'s pre-execution
pipeline and reduces per-call allocation overhead. Individual functions
show significant improvement in `--profile` traces, though end-to-end
wall-clock improvement is within noise on hyperfine benchmarks due to
uninstrumented overhead (stdout serialization, daemon negotiation)
dominating total runtime.
## Changes
### Parallel `to_summary` task construction (`tracker.rs`)
The loop that builds `TaskSummary` structs was sequential. On a large
repo with ~1700 tasks this was ~92ms. Moved to `rayon::par_iter`. Each
`task_summary()` call is read-only on the engine, hash tracker (`RwLock`
read), and package graph.
**Profile:** `to_summary` 92ms → 10ms
### Parallel turbo.json preloading (`loader.rs`, `builder.rs`)
The engine builder loaded each package's `turbo.json` lazily —
sequentially on first access. Added `preload_all()` that reads and
parses all package turbo.json files in parallel via rayon before the
engine builder needs them. The `FixedMap` cache uses `OnceLock` per key,
so concurrent loads are non-blocking.
**Profile:** `build_engine` 74ms → 43ms
### Parallel `connect_internal_dependencies` (`builder.rs`,
`dep_splitter.rs`)
`Dependencies::new` calls that resolve internal vs external deps were
sequential. Each call is read-only on the workspaces map, so moved to
`rayon::par_iter`. Also hoisted
`package_manager.link_workspace_packages()` (which reads a config file
from disk for pnpm/Berry) above the parallel loop so it's computed once
instead of N times.
**Profile:** `connect_internal_dependencies` 52ms → 24ms
### Faster hex encoding in gix index (`repo_index.rs`)
Replaced `e.id.to_hex().to_string()` (goes through `HexDisplay` →
`Display::fmt` → heap `String`) with `hex::encode_to_slice` into a stack
buffer, skipping the intermediate allocation.
### Reduce `find_untracked_files` allocations (`repo_index.rs`)
Eliminated the `Vec<String>` + `Arc` that cloned every
`RepoStatusEntry.path` for binary search. `status_entries` is pre-sorted
before the call, and walker threads binary search directly on the
borrowed `&[RepoStatusEntry]` slice.
### Reduce `TaskHashTracker` per-call overhead (`lib.rs`)
Changed `external_deps_hash_cache` from `HashMap<PackageName, String>`
to `HashMap<String, String>`. Lookups use `task_id.package()` directly
instead of allocating a `PackageName` via `to_workspace_name()` on every
`calculate_task_hash` call.
## Measurement
Profile-based measurements (`turbo run build --dry=json --profile`,
5-run median on a large ~1000 package repo) show the instrumented
portion of the run dropping from ~761ms to ~663ms. However, hyperfine
benchmarks on `--dry` runs across three repos of varying size show no
statistically significant end-to-end improvement — results are
1.00-1.03× with error bars of ±0.09 to ±0.22.
## Testing
- All existing tests pass
- Added a regression test for `connect_internal_dependencies` verifying
graph edges and external dependency classification are correct after
parallelization
- Existing `git_index_regression_tests` (31 tests) validate the hex
encoding and `find_untracked_files` changes
- Deadlock analysis: all parallelization uses either pure read-only
shared references, non-blocking `OnceLock` CAS, or read-only `RwLock`
acquisition with no concurrent writers