refactor(turbo-tasks): Make CURRENT_TASK_STATE shareable across multiple tokio tasks / threads, wait for detached work to finish before exiting a turbo task (#68921)
The current plan is that tasks using `local_cells` will still be spawned on separate threads using `tokio::spawn`. This means two things:
- `CURRENT_TASK_STATE` must use an `Arc<Mutex<...>>` so that multiple tasks across multiple threads can modify the same cell counters, local cells, etc.
- We can't `.take()` the data structures in `CURRENT_TASK_STATE` and perform cleanup of the parent task until all the child tasks using `local_cells` are complete.
Waiting for cleanup of child tasks is performed using [`tokio_util::task::TaskTracker`](https://docs.rs/tokio-util/latest/tokio_util/task/task_tracker/struct.TaskTracker.html), which seems like a reasonably clean and lightweight way of doing this (it uses some refcounts and a `tokio::sync::Notify` instance).
Because `TurboTasks::detached_for_testing` suffered similar problems to what `local_cells` will face in regards to task cleanup, I'm using that API as a way of testing some of these changes.
## Testing
Added test coverage for `TurboTasks::detached_for_testing` that asserts that the parent task waits for the detached task to complete before exiting. Confirmed that the test fails if I comment out
```
ltt.wait().await
```
in the `TurboTasks::schedule` method.
Ran the test with:
```
cargo nextest r -p turbo-tasks-memory -- test_spawns_detached
```
## Benchmarking
*The switch from a `RefCell` to a `RwLock` will inevitably make this code run slower, though hopefully `RwLock` should still be fast enough in the case where there's no lock contention. There would only be lock contention if you're actually using detached futures or `local_cells`.*
On a [low noise machine](https://github.com/bgw/benchmark-scripts):
```
cargo bench -p turbopack-bench -p turbopack-cli -- "hmr_to_eval/Turbopack CSR"
```
Manually modified the test to use more time and samples:
```diff
diff --git a/turbopack/crates/turbopack-bench/src/lib.rs b/turbopack/crates/turbopack-bench/src/lib.rs
index 4e3df12db0..e2f16fb0d6 100644
--- a/turbopack/crates/turbopack-bench/src/lib.rs
+++ b/turbopack/crates/turbopack-bench/src/lib.rs
@@ -127,8 +127,8 @@ enum CodeLocation {
pub fn bench_hmr_to_eval(c: &mut Criterion, bundlers: &[Box<dyn Bundler>]) {
let mut g = c.benchmark_group("bench_hmr_to_eval");
- g.sample_size(10);
- g.measurement_time(Duration::from_secs(60));
+ g.sample_size(50);
+ g.measurement_time(Duration::from_secs(300));
bench_hmr_internal(g, CodeLocation::Evaluation, bundlers);
}
```
**Before:**
```
bench_hmr_to_eval/Turbopack CSR/1000 modules
time: [14.714 ms 14.776 ms 14.854 ms]
Found 2 outliers among 50 measurements (4.00%)
1 (2.00%) high mild
1 (2.00%) high severe
```
**After:**
```
bench_hmr_to_eval/Turbopack CSR/1000 modules
time: [14.773 ms 14.846 ms 14.937 ms]
change: [-0.9937% +0.4037% +1.8024%] (p = 0.58 > 0.05)
No change in performance detected.
Found 1 outliers among 50 measurements (2.00%)
1 (2.00%) high severe
```
## Memory Benchmarking
```
cd ~/next.js
cargo build -p next-build-test --release
cargo run -p next-build-test --release -- generate ~/shadcn-ui/apps/www/ > ~/shadcn-ui/apps/www/project_options.json
cd ~/shadcn-ui/apps/www
heaptrack ~/next.js/target/release/next-build-test run sequential 1 1 '/sink'
heaptrack --analyze /home/bgw.linux/shadcn-ui/apps/www/heaptrack.next-build-test.1138984.zst | tail -50
```
Before *(Run 1)*:
```
total runtime: 791.00s.
calls to allocation functions: 245495035 (310358/s)
temporary memory allocations: 19923850 (25188/s)
peak heap memory consumption: 7.61G
peak RSS (including heaptrack overhead): 9.97G
total memory leaked: 12.53M
suppressed leaks: 7.24K
```
Before *(Run 2)*:
```
total runtime: 635.10s.
calls to allocation functions: 245867464 (387133/s)
temporary memory allocations: 19575558 (30822/s)
peak heap memory consumption: 7.62G
peak RSS (including heaptrack overhead): 9.96G
total memory leaked: 13.16M
suppressed leaks: 7.24K
```
After *(Run 1)*:
```
total runtime: 792.50s.
calls to allocation functions: 253642374 (320053/s)
temporary memory allocations: 19250501 (24290/s)
peak heap memory consumption: 7.62G
peak RSS (including heaptrack overhead): 9.96G
total memory leaked: 12.98M
suppressed leaks: 7.24K
```
After *(Run 2)*:
```
total runtime: 691.16s.
calls to allocation functions: 252890162 (365891/s)
temporary memory allocations: 19143529 (27697/s)
peak heap memory consumption: 7.61G
peak RSS (including heaptrack overhead): 9.98G
total memory leaked: 13.42M
suppressed leaks: 7.24K
```