improve performance of `powermod` (#59930)
based off of and closes https://github.com/JuliaLang/julia/pull/54866
differences to https://github.com/JuliaLang/julia/pull/54866 are:
* checks `iszero(r)` in `mod(a, ::SignedMultiplicativeInverse)` for
cases like `mod(0, multiplicativeinverse(-5))`
* catches some cases of mismatched signed-ness of `x, m` that were
failing
* added a very very rough heuristic `(p > 2sizeof(mm))` so we don't
bother computing the mi when the power is very small
* switched to LSB multiplication loop from MSB
benchmark:
```
using BenchmarkTools
function setup()
x,p,m = rand(NTuple{3, Int})
while iszero(m) || ((p < 0) && !isone(gcd(x, m)))
x,p,m = rand(NTuple{3, Int})
end
return (x, p, m)
end
@benchmark powermod(x, p, m) setup=((x,p,m) = setup())
#master
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min … max): 1.304 μs … 8.938 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.779 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.803 μs ± 270.848 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅████▇▆▄▁▁
▁▁▁▁▁▁▁▂▂▃▅▅▇█████████████▆▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
1.3 μs Histogram: frequency by time 2.76 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
#PR
julia> @benchmark powermod(x, p, m) setup=((x,p,m) = setup())
BenchmarkTools.Trial: 10000 samples with 135 evaluations per sample.
Range (min … max): 667.896 ns … 1.267 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 854.015 ns ┊ GC (median): 0.00%
Time (mean ± σ): 886.407 ns ± 74.610 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃▆██▅▂
▂▂▁▁▂▂▂▂▂▂▂▂▂▂▃▃▃▄▄▅▆▇████████▆▅▄▃▃▃▃▃▄▄▅▆▆▇▇████▇▆▆▅▄▃▃▃▃▂▂ ▄
668 ns Histogram: frequency by time 1.06 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
```
Co-authored-by: Chris Elrod <elrodc@gmail.com>