[LLD][COFF] Prefetch inputs early-on to improve link times (#169224)
This PR reduces outliers in terms of runtime performance, by asking the
OS to prefetch memory-mapped input files in advance, as early as
possible. I have implemented the Linux aspect, however I have only
tested this on Windows 11 version 24H2, with an active security stack
enabled. The machine is a AMD Threadripper PRO 3975WX 32c/64t with 128
GB of RAM and Samsung 990 PRO SSD.
I have used a Unreal Engine-based game to profile the link times. Here's
a quick summary of the input data:
```
Summary
--------------------------------------------------------------------------------
4,169 Input OBJ files (expanded from all cmd-line inputs)
26,325,429,114 Size of all consumed OBJ files (non-lazy), in bytes
9 PDB type server dependencies
0 Precomp OBJ dependencies
350,516,212 Input debug type records
18,146,407,324 Size of all input debug type records, in bytes
15,709,427 Merged TPI records
4,747,187 Merged IPI records
56,408 Output PDB strings
23,410,278 Global symbol records
45,482,231 Module symbol records
1,584,608 Public symbol records
```
In normal conditions - meanning all the pages are already in RAM - this
PR has no noticeable effect:
```
>hyperfine "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 29.689 s ± 0.550 s [User: 259.873 s, System: 37.936 s]
Range (min … max): 29.026 s … 30.880 s 10 runs
Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 29.594 s ± 0.342 s [User: 261.434 s, System: 62.259 s]
Range (min … max): 29.209 s … 30.171 s 10 runs
Summary
with_pr\lld-link.exe @Game.exe.rsp ran
1.00 ± 0.02 times faster than before\lld-link.exe @Game.exe.rsp
```
However when in production conditions, we're typically working with the
Unreal Engine Editor, with exteral DCC tools like Maya, Houdini; we have
several instances of Visual Studio open, VSCode with Rust analyzer, etc.
All this means that between code change iterations, most of the input
OBJs files might have been already evicted from the Windows RAM cache.
Consequently, in the following test, I've simulated the worst case
condition by evicting all data from RAM with
[RAMMap64](https://learn.microsoft.com/en-us/sysinternals/downloads/rammap)
(ie. `RAMMap64.exe -E[wsmt0]` with a 5-sec sleep at the end to ensure
the System thread actually has time to evict the pages)
```
>hyperfine -p cleanup.bat "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 48.124 s ± 1.770 s [User: 269.031 s, System: 41.769 s]
Range (min … max): 46.023 s … 50.388 s 10 runs
Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 34.192 s ± 0.478 s [User: 263.620 s, System: 40.991 s]
Range (min … max): 33.550 s … 34.916 s 10 runs
Summary
with_pr\lld-link.exe @Game.exe.rsp ran
1.41 ± 0.06 times faster than before\lld-link.exe @Game.exe.rsp
```
This is similar to the work done in MachO in
https://github.com/llvm/llvm-project/pull/157917