julia
50063122 - Reduce matmul latency by splitting small matmul (#54421)

Commit
1 year ago
Reduce matmul latency by splitting small matmul (#54421) This splits the `matmul2x2` and `matmul3x3` into components that depend on `MulAddMul` and those that don't depend on it. This improves compilation time, as the `MulAddMul`-independent methods won't need to be recompiled in the `@stable_muladdmul` branches. TTFX (each call timed in a separate session): ```julia julia> using LinearAlgebra julia> A = rand(2,2); B = Symmetric(rand(2,2)); C = zeros(2,2); julia> @time mul!(C, A, B); 1.927468 seconds (5.67 M allocations: 282.523 MiB, 12.09% gc time, 100.00% compilation time) # nightly v"1.12.0-DEV.492" 1.282717 seconds (4.46 M allocations: 228.816 MiB, 4.58% gc time, 100.00% compilation time) # This PR julia> A = rand(2,2); B = rand(2,2); C = zeros(2,2); julia> @time mul!(C, A, B); 1.653368 seconds (5.75 M allocations: 291.586 MiB, 13.94% gc time, 100.00% compilation time) # nightly 1.148330 seconds (4.46 M allocations: 230.714 MiB, 4.47% gc time, 100.00% compilation time) # This PR ``` Edit: Not inlining the function seems to incur a runtime perfomance cost. ```julia julia> using LinearAlgebra julia> A = rand(3,3); B = rand(size(A)...); C = zeros(size(A)); julia> @btime mul!($C, $A, $B); 23.923 ns (0 allocations: 0 bytes) # nightly 31.732 ns (0 allocations: 0 bytes) # This PR ``` Adding `@inline` annotations resolves this difference, but this reintroduces the compilation latency. The tradeoff is perhaps ok, as users may use `StaticArrays` for performance-critical matrix multiplications.
Author
Parents
Loading