Enable per thread register state cache on libunwind (#55049)
Looking into a profile recently I realized that when recording
backtraces the CPU utilization is mostly dominated by lookups/updates to
libunwind's register state cache (`get_rs_cache`, `put_rs_cache`):

It is also worth noting that those functions are taking a lock and using
`sigprocmask` which does not scale, so by recording backtraces in
parallel we get:

And this translates to these times on a recent laptop (Linux X86_64):
```
julia> @time for i in 1:1000000 Base.backtrace() end
8.286924 seconds (32.00 M allocations: 8.389 GiB, 1.46% gc time)
julia> @time Threads.@sync for i in 1:16
Threads.@spawn for j in 1:1000000
Base.backtrace()
end
end
20.448630 seconds (160.01 M allocations: 123.740 GiB, 8.05% gc time, 0.43% compilation time: 18% of which was recompilation)
```
Good news is that libunwind already has the solution for this in the
form of the `--enable-per-thread-cache` build option which uses a thread
local cache for register state instead of the default global one
([1](https://libunwind-devel.nongnu.narkive.com/V3gtFUL9/question-about-performance-of-threaded-access-in-libunwind)).
But this is not without some hiccups due to how we `dlopen` libunwind so
we need a small patch
([2](https://libunwind-devel.nongnu.narkive.com/QG1K3Uke/tls-model-initial-exec-attribute-prevents-dynamic-loading-of-libunwind-via-dlopen)).
By applying those changes we get:
```
julia> @time for i in 1:1000000 Base.backtrace() end
2.378070 seconds (32.00 M allocations: 8.389 GiB, 4.72% gc time)
julia> @time Threads.@sync for i in 1:16
Threads.@spawn for j in 1:1000000
Base.backtrace()
end
end
3.657772 seconds (160.01 M allocations: 123.740 GiB, 52.05% gc time, 2.33% compilation time: 19% of which was recompilation)
```
Single-Threaded:

Multi-Threaded:

As a companion to this PR I have created another one for applying the
same change to LibUnwind_jll [on
Yggdrasil](https://github.com/JuliaPackaging/Yggdrasil/pull/9030). After
that lands we can bump the version here.