qgemm: optimize avxvnni QGEMM inner kernel for M=1 (#22952)
Add specialized path for M=1 case that exploits additional available ymm
registers for deeper inner kernel loop unrolling.
Performance impact (measured on 13th Gen Intel(R) Core(TM) i9-13900K):
- 30% improvement in single threaded QGEMM kernels with M = 1
- 7% reduction in average inference time on small quantized model where
all kernels have M=1
```
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| Benchmark | Time | CPU | Time Old | Time New | CPU Old | CPU New |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time | -0.275 | -0.2756 | 4330 | 3137 | 4330 | 3136 |
| QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time | -0.292 | -0.2927 | 9027 | 6385 | 9027 | 6385 |
| QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867 | 12499 | 17866 | 12498 |
| OVERALL_GEOMEAN | -0.289 | -0.2897 | | | | |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
```
---------
Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>