llama.cpp
3d011226 - CUDA : faster k-quant dot kernels (#1862)

Commit

2 years ago

CUDA : faster k-quant dot kernels (#1862) * cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

References

#1862 - CUDA : faster k-quant dot kernels

Author

ikawrakow

Parents

602c7488

llama.cpp 3d011226 - CUDA : faster k-quant dot kernels (#1862)

llama.cpp
3d011226 - CUDA : faster k-quant dot kernels (#1862)