[ELF] Parallelize input file loading (#191690)
During `createFiles`, `addFile()` records a `LoadJob` for each
non-script input (archive, relocatable, DSO, bitcode, binary) with a
state-machine snapshot (`inWholeArchive`, `inLib`, `asNeeded`,
`withLOption`, `groupId`) and expands them on worker threads in
`loadFiles()`. Linker scripts are still processed inline since their
`INPUT()` and `GROUP()` commands recursively call `addFile()`.
Outside `createFiles()`, `loadFiles()` is called with a single job and
drained immediately (`deferLoad` is false). Two cases:
- `addDependentLibrary()`: `.deplibs` sections trigger `addFile()`
during the serial `doParseFiles()` loop.
- `--just-symbols`: pushes files directly, bypassing
`addFile`/`LoadJob`.
Thread-safety:
- A mutex serializes `BitcodeFile` / fatLTO constructors that call
`ctx.saver` / `ctx.uniqueSaver`. Zero contention on pure ELF links.
- Thin-archive member buffers accumulate in per-job `SmallVector`s and
are merged into `ctx.memoryBuffers` in command-line order.
- `groupId` is pre-claimed during the serial walk and written to each
produced file after construction (the `InputFile` constructor no
longer reads `nextGroupId`).
Performance (--threads=8):
```
clang-relassert (267 thin archives, 10 .o, 2 .so):
965 +/- 32 ms -> 924 +/- 24 ms (1.05x, 80 runs)
(Apple M4) 249.7ms ± 2.5ms -> 221.2ms ± 1.4ms (1.13x, 10 runs)
chromium (532 .a, 3314 .o, 343 .so):
8.071 +/- 0.472 s -> 7.370 +/- 0.198 s (1.10x, 20 runs)
```
Parallelizing all file kinds (not just archives) matters for
.o-dominated workloads like chromium where archive-only parallelization
shows no improvement.
Output is byte-identical to the old lld and deterministic across
`--threads` values.