Specialize `isbanded` for `StridedMatrix` (#56487)
This improves performance, as the loops in `istriu` and `istril` may be
fused to improve cache-locality.
This also changes the quick-return behavior, and only returns after the
check over all the upper or lower bands for a column is complete.
```julia
julia> using LinearAlgebra
julia> A = zeros(2, 10_000);
julia> @btime isdiag($A);
32.682 μs (0 allocations: 0 bytes) # nightly v"1.12.0-DEV.1593"
9.481 μs (0 allocations: 0 bytes) # this PR
julia> A = zeros(10_000, 2);
julia> @btime isdiag($A);
10.288 μs (0 allocations: 0 bytes) # nightly
2.579 μs (0 allocations: 0 bytes) # this PR
julia> A = zeros(100, 100);
julia> @btime isdiag($A);
6.616 μs (0 allocations: 0 bytes) # nightly
3.075 μs (0 allocations: 0 bytes) # this PR
julia> A = diagm(0=>1:100); A[3,4] = 1;
julia> @btime isdiag($A);
2.759 μs (0 allocations: 0 bytes) # nightly
85.371 ns (0 allocations: 0 bytes) # this PR
```
A similar change is added to `istriu`/`istril` as well, so that
```julia
julia> A = zeros(2, 10_000);
julia> @btime istriu($A); # trivial
7.358 ns (0 allocations: 0 bytes) # nightly
13.779 ns (0 allocations: 0 bytes) # this PR
julia> @btime istril($A);
33.464 μs (0 allocations: 0 bytes) # nightly
9.476 μs (0 allocations: 0 bytes) # this PR
julia> A = zeros(10_000, 2);
julia> @btime istriu($A);
10.020 μs (0 allocations: 0 bytes) # nightly
2.620 μs (0 allocations: 0 bytes) # this PR
julia> @btime istril($A); # trivial
6.793 ns (0 allocations: 0 bytes) # nightly
14.473 ns (0 allocations: 0 bytes) # this PR
julia> A = zeros(100, 100);
julia> @btime istriu($A);
3.435 μs (0 allocations: 0 bytes) # nightly
1.637 μs (0 allocations: 0 bytes) # this PR
julia> @btime istril($A);
3.353 μs (0 allocations: 0 bytes) # nightly
1.661 μs (0 allocations: 0 bytes) # this PR
```
---------
Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>