faster `rand(1:n)` by outlining unlikely branch (#58089)
It's hard to measure the improvement with single calls, but this change
substantially improve the situation in #50509, such that these new
versions of `randperm` etc are almost always faster (even for big n).
Here are some example benchmarks. Note that biggest ranges like
`UInt(0):UInt(2)^64-2` are the ones exercising the most the "unlikely"
branch:
```julia
julia> const xx = Xoshiro(); using Chairmarks
julia> rands(rng, ns) = for i=ns
rand(rng, zero(i):i)
end
julia> rands(ns) = for i=ns
rand(zero(i):i)
end
julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1)
(1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR
(2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master
julia> @b rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG
(2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR
(3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master
julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2)
(95.315 ns, 132.144 ns, 7.486 ns) # PR
(217.169 ns, 143.519 ns, 8.065 ns) # master
julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2)
(235.882 ns, 162.809 ns, 10.603 ns) # PR
(202.524 ns, 132.869 ns, 7.631 ns) # master
```
So it's a bit tricky: with an explicit RNG, `rands(xx, 1:100)` becomes
much faster, but without, `rands(1:100)` becomes slower.
Assuming #50509 was merged, `shuffle` is a good function to benchmark
`rand(1:n)`, and the changes here consistently improve performance, as
shown by this graph (when `TaskLocalRNG` is mentioned, it means *no* RNG
argument was passed to the function):

So although there can be slowdowns, I think this change is overall a
win.