Scaling loop instead of broadcasting in strided matrix exp (#56463)
Firstly, this is easier to read. Secondly, this merges the two loops
into one. Thirdly, this avoids the broadcasting latency.
```julia
julia> using LinearAlgebra
julia> A = rand(2,2);
julia> @time LinearAlgebra.exp!(A);
0.952597 seconds (2.35 M allocations: 116.574 MiB, 2.67% gc time, 99.01% compilation time) # master
0.877404 seconds (2.17 M allocations: 106.293 MiB, 2.65% gc time, 99.99% compilation time) # this PR
```
The performance also improves as there are fewer allocations in the
first branch (`opnorm(A, 1) <= 2.1`):
```julia
julia> B = diagm(0=>im.*(float.(1:200))./200, 1=>(1:199)./400, -1=>(1:199)./400);
julia> opnorm(B,1)
1.9875
julia> @btime exp($B);
5.066 ms (30 allocations: 4.89 MiB) # nightly v"1.12.0-DEV.1581"
4.926 ms (27 allocations: 4.28 MiB) # this PR
```