llvm-project
77b8b33b - [LLD][COFF] Prefetch inputs early-on to improve link times (#169224)

Commit
103 days ago
[LLD][COFF] Prefetch inputs early-on to improve link times (#169224) This PR reduces outliers in terms of runtime performance, by asking the OS to prefetch memory-mapped input files in advance, as early as possible. I have implemented the Linux aspect, however I have only tested this on Windows 11 version 24H2, with an active security stack enabled. The machine is a AMD Threadripper PRO 3975WX 32c/64t with 128 GB of RAM and Samsung 990 PRO SSD. I have used a Unreal Engine-based game to profile the link times. Here's a quick summary of the input data: ``` Summary -------------------------------------------------------------------------------- 4,169 Input OBJ files (expanded from all cmd-line inputs) 26,325,429,114 Size of all consumed OBJ files (non-lazy), in bytes 9 PDB type server dependencies 0 Precomp OBJ dependencies 350,516,212 Input debug type records 18,146,407,324 Size of all input debug type records, in bytes 15,709,427 Merged TPI records 4,747,187 Merged IPI records 56,408 Output PDB strings 23,410,278 Global symbol records 45,482,231 Module symbol records 1,584,608 Public symbol records ``` In normal conditions - meanning all the pages are already in RAM - this PR has no noticeable effect: ``` >hyperfine "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp" Benchmark 1: before\lld-link.exe @Game.exe.rsp Time (mean ± σ): 29.689 s ± 0.550 s [User: 259.873 s, System: 37.936 s] Range (min … max): 29.026 s … 30.880 s 10 runs Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp Time (mean ± σ): 29.594 s ± 0.342 s [User: 261.434 s, System: 62.259 s] Range (min … max): 29.209 s … 30.171 s 10 runs Summary with_pr\lld-link.exe @Game.exe.rsp ran 1.00 ± 0.02 times faster than before\lld-link.exe @Game.exe.rsp ``` However when in production conditions, we're typically working with the Unreal Engine Editor, with exteral DCC tools like Maya, Houdini; we have several instances of Visual Studio open, VSCode with Rust analyzer, etc. All this means that between code change iterations, most of the input OBJs files might have been already evicted from the Windows RAM cache. Consequently, in the following test, I've simulated the worst case condition by evicting all data from RAM with [RAMMap64](https://learn.microsoft.com/en-us/sysinternals/downloads/rammap) (ie. `RAMMap64.exe -E[wsmt0]` with a 5-sec sleep at the end to ensure the System thread actually has time to evict the pages) ``` >hyperfine -p cleanup.bat "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp" Benchmark 1: before\lld-link.exe @Game.exe.rsp Time (mean ± σ): 48.124 s ± 1.770 s [User: 269.031 s, System: 41.769 s] Range (min … max): 46.023 s … 50.388 s 10 runs Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp Time (mean ± σ): 34.192 s ± 0.478 s [User: 263.620 s, System: 40.991 s] Range (min … max): 33.550 s … 34.916 s 10 runs Summary with_pr\lld-link.exe @Game.exe.rsp ran 1.41 ± 0.06 times faster than before\lld-link.exe @Game.exe.rsp ``` This is similar to the work done in MachO in https://github.com/llvm/llvm-project/pull/157917
Author
Parents
Loading