[LMI] Support non-power-of-2 types for the matmul remainder (#163987)
In the inner loop of matmul, instead of continuously halving the HW
vector register width, I just use the remainder vector directly if it's
legal.
We don't have in-tree targets that have this so I opted for adding a
hidden flag to simulate this for testing purposes:
-matrix-split-matmul-remainder-over-threshold
The tests are the vectorization-friendly 3x3x1 matrix-vector and 1x3x3
vector-matrix multiplies for CM, RM respectively.