[x86] matmulnbit x64 kernel for 8bits (#24491)
### Description
Add 8bits support for matmulnbits on x86
__AVX512 VNNI__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |
|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** |
| 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** |
| 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** |
| 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** |
| 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** |
| 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** |
| 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** |
| 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** |
__AVX512__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |
|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** |
| 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** |
| 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** |
| 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** |
| 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** |
| 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** |
| 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** |
| 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** |
Token generation is bottlenecked at memory access. 8b model's 2x size is
major reason of token generation slow down.
For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow
extra instructions are needed. This is the major reason of non-vnni slow
down.
### Motivation and Context
MatMul4Bits model has repetition issue. 6b model resolved this issue.