Reduce matmul latency by splitting small matmul (#54421)
This splits the `matmul2x2` and `matmul3x3` into components that depend
on `MulAddMul` and those that don't depend on it. This improves
compilation time, as the `MulAddMul`-independent methods won't need to
be recompiled in the `@stable_muladdmul` branches.
TTFX (each call timed in a separate session):
```julia
julia> using LinearAlgebra
julia> A = rand(2,2); B = Symmetric(rand(2,2)); C = zeros(2,2);
julia> @time mul!(C, A, B);
1.927468 seconds (5.67 M allocations: 282.523 MiB, 12.09% gc time, 100.00% compilation time) # nightly v"1.12.0-DEV.492"
1.282717 seconds (4.46 M allocations: 228.816 MiB, 4.58% gc time, 100.00% compilation time) # This PR
julia> A = rand(2,2); B = rand(2,2); C = zeros(2,2);
julia> @time mul!(C, A, B);
1.653368 seconds (5.75 M allocations: 291.586 MiB, 13.94% gc time, 100.00% compilation time) # nightly
1.148330 seconds (4.46 M allocations: 230.714 MiB, 4.47% gc time, 100.00% compilation time) # This PR
```
Edit: Not inlining the function seems to incur a runtime perfomance
cost.
```julia
julia> using LinearAlgebra
julia> A = rand(3,3); B = rand(size(A)...); C = zeros(size(A));
julia> @btime mul!($C, $A, $B);
23.923 ns (0 allocations: 0 bytes) # nightly
31.732 ns (0 allocations: 0 bytes) # This PR
```
Adding `@inline` annotations resolves this difference, but this
reintroduces the compilation latency. The tradeoff is perhaps ok, as
users may use `StaticArrays` for performance-critical matrix
multiplications.