turbo-tasks: task-storage memory wins (#93720)
## Summary
Four small, independent changes that shrink `TaskStorage` and the data
it owns:
Recommend reviewing commit-by-commit
1. **`Arc<CachedTaskType>` → `triomphe::Arc<CachedTaskType>`.**
`triomphe::Arc` is already a workspace dep used in `ReadRef` /
`SharedReference`. `CachedTaskType` never appears in a `Weak<...>`, so
we can drop the weak count and the CAS in `drop_slow`. Saves one `usize`
per allocation. Migrated via a `CachedTaskTypeArc` newtype so the
bincode `Encode`/`Decode` impls don't need to cross the orphan rule.
2. **Niche-encode `CellDependency`.** The `cell_dependencies` /
`cell_dependents` sets used to hold `(CellRef, Option<u64>)` tuples —
`Option<u64>` cost a full 16 B (8 B discriminant + 8 B value, aligned),
making each element 32 B. A `CellDependency` enum with two variants
(`All(CellRef)` / `Hash(CellRef, u64)`) lets the layout algorithm reuse
the niche on `ValueTypeId` (`NonZero<u16>`) inside
`CellRef.cell.type_id` for the variant tag. Element size drops 32 → 24
B; `LazyField` from 56 → 48 B. The same enum backs both forward and
reverse edges — for `cell_dependents` we re-point `CellRef.task` at the
dependent task.
Added `CellDependency::into_parts()` and use it in
`iter_cell_dependents` / `iter_cell_dependencies` hot loops so the
discriminant is checked once instead of twice via back-to-back
`cell_ref()` + `key()` calls.
3. **`TaskStorage::lazy: Vec<LazyField>` → `TinyVec<LazyField>`.** The
lazy vec only ever holds ~25 elements (one per declared lazy field in
the schema). Swapping `Vec`'s 24 B `(ptr, len, cap)` header for `(ptr,
len: u8, cap: u8)` + 6 B padding gives 16 B. Drops
`size_of::<TaskStorage>()` from 136 → 128 B.
`TinyVec` is hand-rolled so I added a push/iter micro-benchmark to
confirm it doesn't lose performance vs std `Vec`. Results below.
4. **Rightsize collections** → Explore the `AutoSet`/`AutoMap` types in
storage_schema and ensure each one is maximally sized for its natural
alignment.
## Benchmark results
### `next build` on a representative app (15 runs each, M4 Pro,
`caffeinate -dimsu nice -n -20`)
Fresh same-day baseline against branch:
| metric | canary | branch | Δ | 95% CI | significant? |
|---|---:|---:|---:|---|:---:|
| wall time | 40.83s | 41.12s | +0.7% | [−1.07s, +1.64s] | no |
| user time | 282.27s | 283.21s | +0.3% | [−1.02s, +2.89s] | no |
| sys time | 69.38s | 71.26s | +2.7% | [−1.54s, +5.32s] | no |
| **MaxRSS** | **12.47 GB** | **12.04 GB** | **−3.4%** | **[−0.48 GB,
−0.38 GB]** | **yes** |
**MaxRSS is the headline.** −0.43 GB on a 12.5 GB working set, with
t=−17.86 (every branch run lower than every canary run, CV ≤ 0.6% on
both sides). Wall / user / sys are all within noise — this PR is a
memory win with no measurable timing impact.
### `TinyVec` vs `Vec` micro-bench (`turbo-tasks/benches/tiny_vec.rs`,
200 samples each)
| n | Vec push | TinyVec push | Δ% | Vec iter | TinyVec iter | Δ% |
|---:|---:|---:|---:|---:|---:|---:|
| 0 | 1.31ns | 894ps | **−31.8%** | 598ps | 596ps | −0.4% |
| 1 | 16.92ns | 14.75ns | **−12.9%** | 964ps | 952ps | −1.2% |
| 4 | 17.93ns | 15.93ns | **−11.1%** | 1.49ns | 1.50ns | +0.5% |
| 8 | 63.13ns | 45.24ns | **−28.3%** | 1.97ns | 1.96ns | −0.2% |
| 16 | 97.35ns | 79.91ns | **−17.9%** | 3.16ns | 3.14ns | −0.5% |
| 24 | 137.41ns | 119.88ns | **−12.8%** | 4.30ns | 4.30ns | +0.0% |
TinyVec push is 11–32% faster than Vec push across all realistic sizes;
iter is identical. Run with `cargo bench -p turbo-tasks --bench
tiny_vec`.
### `task_overhead/turbo` Criterion bench (M4 Pro, `--sample-size 200`)
| variant | dur | canary | branch | Δ | significant? |
|---|---:|---:|---:|---:|:---:|
| turbo-uncached | 1µs | 9.77 µs | 9.68 µs | −1.0% | yes |
| turbo-uncached | 1000µs | 1.01 ms | 1.01 ms | −0.1% | yes |
| turbo-cached-same-keys | 1µs | 198.6 ns | 191.9 ns | −3.4% | yes |
| turbo-cached-same-keys | 100µs | 226.5 ns | 208.1 ns | −8.1% | yes |
| turbo-cached-different-keys | 1µs | 233.8 ns | 224.1 ns | −4.2% | yes
|
| turbo-cached-different-keys | 100µs | 305.3 ns | 246.9 ns | −19.1% |
yes |
| turbo-uncached-parallel | 10µs | 1.63 µs | 1.54 µs | −5.8% | yes |
| turbo-uncached-parallel | 100µs | 8.41 µs | 7.88 µs | −6.3% | yes |
<!-- NEXT_JS_LLM_PR -->