Add PGO+LTO Makefile (#45641)
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:
1. `cd contrib/pgo-lto`
2. `make -j$(nproc) stage1`
3. `make clean-profiles`
4. `./stage1.build/julia -O3 -e 'using Pkg;
Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'`
5. `make -j$(nproc) stage2`
<details>
<summary>* Output looks roughly like as follows</summary>
```c++
$ make -C contrib/pgo-lto top
make: Entering directory '/dev/shm/julia/contrib/pgo-lto'
llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt
Instrumentation level: IR entry_first = 0
Total functions: 85943
Maximum function count: 7867557260
Maximum internal block count: 3468437590
Top 50 functions with the largest internal block counts:
llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260
LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590
llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834
llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575
llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762
llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177
std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888
llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728
llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040
llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953
ijl_method_instance_add_backedge, max count = 349608221
llvm::SUnit::ComputeHeight(), max count = 336604330
llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109
llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545
llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540
LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274
/dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464
ijl_get_pgcstack, max count = 216953592
LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152
/dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813
/dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603
```
</details>
This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.
Example 1:
```julia
using LoopVectorization
function f!(a, b)
@turbo for i in eachindex(a)
a[i] *= b[i]
end
return a
end
f!(rand(1), rand(1))
```
```console
$ time ./julia -O3 lv.jl
```
Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)
Example 2:
```console
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
```
Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)
Example 3 (taken from issue #45395, which is almost only LLVM):
```console
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
```
Without PGO+LTO:
```
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 101.0130 seconds (98.6253 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops
25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering
7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification
6.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions #2
```
With PGO+LTO:
```
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 72.6507 seconds (70.1337 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops
16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering
5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification
4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions #2
```
Or -28% time spent in LLVM.
`perf` reports show this is mostly fewer instructions and reduction in
icache misses.
---
Finally there's a significant reduction in binary sizes. For libLLVM.so:
```
79M usr/lib/libLLVM-13jl.so (before)
67M usr/lib/libLLVM-13jl.so (after)
```
And it can be reduced by another 2MB with `--icf=safe` when using LLD as
a linker anyways.
- [x] Two out-of-source builds would be better than a single in-source
build, so that it's easier to find good profile data
---------
Co-authored-by: Oscar Smith <oscardssmith@gmail.com>
Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>