metal : add Q4_K implementation (#1733)

Commit

2 years ago

metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

References

#1733 - Q4_K implementation for Metal

Author

ikawrakow

Parents

00358582

llama.cpp 4161bdc0 - metal : add Q4_K implementation (#1733)

llama.cpp
4161bdc0 - metal : add Q4_K implementation (#1733)