metal : Q3_K speedup (#2995)
* Slightly faster Q3_K and Q5_K on metal
* Another Q3_K speedup on metal
Combined with previous commit, we are now +9.6% for TG.
PP is not affected as this happens via the matrix multiplication
templates.
* Slowly progressing on Q3_K on metal
We are now 13% faster than master
* nother small improvement for Q3_K on metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>