Implement faster thread local rng for scheduler (#55501)
Implement optimal uniform random number generator using the method
proposed in https://github.com/swiftlang/swift/pull/39143 based on
OpenSSL's implementation of it in
https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99
This PR also fixes some bugs found while developing it. This is a
replacement for https://github.com/JuliaLang/julia/pull/50203 and fixes
the issues found by @IanButterworth with both rngs
C rng
<img width="1011" alt="image"
src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060">
New scheduler rng
<img width="985" alt="image"
src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa">
~On my benchmarks the julia implementation seems to be almost 50% faster
than the current implementation.~
With oscars suggestion of removing the debiasing this is now almost 5x
faster than the original implementation. And almost fully branchless
We might want to backport the two previous commits since they
technically fix bugs.
---------
Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>