[inductor] Lower small gemvs on CPU (#110456)
If the gemv fits in registers, like [1,16]*[16,16], MKL isn't going to
do much better than compiling a simple for-loop, and we end up paying
allocation overhead and ATen overhead.
A very small internal inference model drops from 7->5 us with this change.
Differential Revision: [D49875991](https://our.internmc.facebook.com/intern/diff/D49875991/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110456
Approved by: https://github.com/chenyang78, https://github.com/jgong5