Selectively enable opt-level 1 (#8141)
This PR compiles all non-workspace dependencies, as well as
`turbo-tasks-memory` (which is particularly sensitive) with basic
optimizations. Most crates in the workspace still use opt-level 0
locally.
While not as good as applying opt-level 1 everywhere, this significantly
reduces execution times versus opt-level 0, while making cold builds
about 50-60% slower. Warm build times are largely unaffected.
The debugging (gdb/lldb) experience may also be slightly worsened by the
optimizations.
**What about `cargo check`/`cargo clippy`/`rust-analyzer`?** No expected
change, as (outside of proc macros) these don't perform LLVM code
generation.
**Why selectively, and not everywhere?** While applying this everwhere
can give us about 3x faster execution, this still gives us *most* of the
runtime performance benefits, while avoiding *most* of the compilation
cost (especially for warm builds). I believe we should still optimize
more for build times than execution times. I benchmarked applying
opt-level 1 to all crates here:
https://docs.google.com/document/d/1iaREbzYpDmBt54fT2egzptTfx0OYsTIJ633gRqddzDY/edit?usp=sharing
**Why not just a few hot dependencies?** I tried profiling the debug
build and only optimizing the hot crates, but I wasn't able to get
meaningful improvements in my testing.
# Benchmarking Notes
- System configuration is here: https://github.com/bgw/benchmark-scripts
. This is a downclocked machine with most CPU cores disabled to get
low-noise measurements. **Treat these results as relative to each other,
not as absolute values.**
- Build benchmarks are run with `mold`, as GNU `ld` is incredibly slow
(and often causes OOMs with 16GB of RAM). We're already using mold in
the private nextpack meta-repository. I'll follow up with another PR to
use mold or lld by default.
# Build Time Benchmarks
There's a significant regression to cold builds, but there's no
meaningful regression for warm builds.
## Cold time to build tests (2 runs):
```
rm -rf target/ && time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo nextest run -- dummy_filter_build_only_dont_run_any_tests
```
Before:
```
real 9m29.839s
real 9m27.522s
```
After:
```
real 15m28.105s
real 15m28.577s
```
## Warm time to build tests (2 runs):
Modify a string in an error message inside of
`crates/turbopack-ecmascript/src/minify.rs`. This guarantees forced
recompilation of all dependent crates without meaninfully changing any
behavior. Then run:
```
time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo nextest run -- dummy_filter_build_only_dont_run_any_tests
```
Before:
```
real 1m33.497s
real 1m36.134s
```
After:
```
real 1m41.232s
real 1m40.153s
```
## Warm time to build single binary (2 runs):
This is less dependent on linking than the tests, which generate many
binary targets.
Modify a string in an error message inside of
`crates/turbopack-ecmascript/src/minify.rs`. This guarantees forced
recompilation of all dependent crates without meaninfully changing any
behavior. Then run:
```
time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo build -p turbopack-cli
```
Before:
```
real 0m37.565s
real 0m37.058s
```
After:
```
real 0m36.450s
real 0m36.849s
```
## Cold time to build a single turborepo binary:
```
rm -rf target/ && time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo build -p turbo
```
Before:
```
real 3m43.488s
```
After:
```
real 4m54.416s
```
# Execution Time Benchmarks
## turbopack-cli's `bench_startup`
```
cargo bench --profile dev -p turbopack-cli
```
Before:
```
bench_startup/Turbopack CSR/1000 modules
time: [20.744 s 20.869 s 20.995 s]
```
After:
```
bench_startup/Turbopack CSR/1000 modules
time: [7.8037 s 7.8505 s 7.9030 s]
```
## Test Execution (excluding build, 2 runs)
With a completely warm build cache (such that nothing needs to build),
run:
```
time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo nextest run -E 'not test(node_file_trace)'
```
Before:
```
real 2m51.767s
real 2m51.482s
```
After:
```
real 1m17.286s
real 1m12.520s
```