unroll tuple allequal for performance (#61433)
in a similar vein to https://github.com/JuliaLang/julia/pull/61426, we
can speed up `allequal` by unrolling the loop (up to a cap, 32 chosen by
convention)
I suppose this is not particularly a super common bottleneck but we may
as well be faster where possible.
master:
```
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 5))
BenchmarkTools.Trial: 10000 samples with 998 evaluations per sample.
Range (min … max): 13.861 ns … 8.303 μs ┊ GC (min … max): 0.00% … 99.17%
Time (median): 18.412 ns ┊ GC (median): 0.00%
Time (mean ± σ): 33.582 ns ± 122.345 ns ┊ GC (mean ± σ): 6.08% ± 1.71%
▅▇█▇▅▂ ▁▄▄▄▃▃▁ ▁▄▅▄▃▃▂▁ ▃▄▄▃▂▁▁▂▂▁ ▁▂▂▁ ▁▁▃▂ ▂
██████▅▅▃▅▃▁▄▅███████▇▆████████▇▇▇███████████████▇▆▆█████▆▇▆ █
13.9 ns Histogram: log(frequency) by time 83.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 12))
BenchmarkTools.Trial: 624 samples with 997 evaluations per sample.
Range (min … max): 16.090 ns … 42.490 μs ┊ GC (min … max): 0.00% … 73.54%
Time (median): 10.193 μs ┊ GC (median): 0.00%
Time (mean ± σ): 8.034 μs ± 4.193 μs ┊ GC (mean ± σ): 0.62% ± 2.94%
█ ▆▅▆
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▆▅█▃▂▁▁▁▁▁▁▁▆███▆▃▃ ▃
16.1 ns Histogram: frequency by time 11.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 56))
BenchmarkTools.Trial: 480 samples with 1 evaluation per sample.
Range (min … max): 9.840 ms … 48.062 ms ┊ GC (min … max): 0.00% … 76.38%
Time (median): 10.312 ms ┊ GC (median): 0.00%
Time (mean ± σ): 10.399 ms ± 1.744 ms ┊ GC (mean ± σ): 0.74% ± 3.49%
▁▇ ▁▆▄▁▂▃▂▃▃▆█▆▃▄▂▁▁▁ ▁
▄▄▃▅▇██▇██████████████████▇█▄▄▄▃▂▂▁▃▃▃▂▁▂▂▁▁▁▂▁▁▂▁▂▃▂▁▁▁▁▁▂ ▄
9.84 ms Histogram: frequency by time 11.5 ms <
Memory estimate: 1.45 MiB, allocs estimate: 27954.
```
PR
```
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 5))
BenchmarkTools.Trial: 10000 samples with 998 evaluations per sample.
Range (min … max): 14.445 ns … 91.516 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.868 ns ┊ GC (median): 0.00%
Time (mean ± σ): 16.809 ns ± 1.603 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▃▁ █▁▁▂▁
▁▂▄▅▄▃▄▄▄▃▂▂▂▁▂▄▇█████▇▅▃▃▂▃▇█████▇▄▃▂▂▃▃▃▄▄▄▅▄▄▃▂▂▁▁▁▁▁▁▁▁ ▃
14.4 ns Histogram: frequency by time 19.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 12))
BenchmarkTools.Trial: 952 samples with 998 evaluations per sample.
Range (min … max): 15.697 ns … 20.862 μs ┊ GC (min … max): 0.00% … 62.59%
Time (median): 6.387 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.256 μs ± 3.257 μs ┊ GC (mean ± σ): 0.48% ± 2.84%
█
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▃▂▂▃▃▃▃▄▄▃▄▄▄▃▃▄▄▅▄▄▃▃▃▄▃▃▃▃▃▂ ▃
15.7 ns Histogram: frequency by time 9.37 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 56))
BenchmarkTools.Trial: 645 samples with 1 evaluation per sample.
Range (min … max): 6.847 ms … 23.438 ms ┊ GC (min … max): 0.00% … 62.03%
Time (median): 7.830 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.730 ms ± 827.062 μs ┊ GC (mean ± σ): 0.29% ± 2.44%
▁▂▃▁ ▅█▄▁▁
▃▇████▆█▇▆▄▄▄▄▃▄▃▃▃▃▃▄▄▄▇█████▇▇▆▇▄▅▄▃▄▅▄▃▄▃▃▄▃▃▃▄▃▃▃▃▂▃▁▃▂ ▄
6.85 ms Histogram: frequency by time 9.08 ms <
Memory estimate: 488.16 KiB, allocs estimate: 9482.
```