Nice work making MMQ so fast!
Are IQ-Quants supported by the recent speedups? If not, perhaps it's possible to still use cublas for these by default, as many people like to use iq-quants.
Only legacy quants and K-quants have a MMQ implementation at all. For all other data formats cuBLAS is the only option available and there is no change.
Would it be possible to have a command line argument to chose mmq, or cublas, as long as the corresponding architectures are compiled? It'd be great for simplicity of choice, and also for downstream implementations like in KoboldCPP.
Also, to mention in the CMakeList what arch is compatible / faster for each NVIDIA chips generation since Kepler/Maxwell.
And, to make it clear for the profane, does mmvq automatically triggers if mmq mode is on and the blas batch size =< 8?
In what cases would you want to use cuBLAS? Command line options have to go through llama.cpp, which requires changes to the llama.cpp API, and then they have to be passed to the backend, which requires adding more exceptions for some backends. They should not be added unless there is a very good reason to do so.
It could maybe be done via environment variables instead which would require no changes to the CLI. But with the current structure where the choice is made at compile time you can skip some kernel variants that you know will never be used so there would be an increase in compilation time and binary size if you were to make it dynamic.
@slaren : in case MMQ doesn't work or performs badly for some reason, Cublas other might, that's my simple "user based" thinking. If everything is always optimal by default as long as the proper architectures are compiled, then my request is irrelevant, but is it always the case?
This being said, I understand well enough your argument and its precedence.
@JohannesGaessler That would be great, especially if much simpler to implement and maintain. Compiling time or binary size doesn't bother me, as long as the resulting binaries offer a maximum amount of flexibility to the final users with an existing but even more modest tech litteracy than my own.
An environment variable would be much less intrusive, but I don't think it is a good idea to add more environment variables as a preventive measure.
Login to write a write a comment.
This PR makes it so that by default mul_mat_q instead of FP16 cuBLAS GEMM is used unless
__dp4a
instruction is unavailable (P100 or older).Performance comparisons can be found in #8062 . To make the new kernels actually available I added compute capability 7.5 to CMake. I added a new compilation option
LLAMA_CUDA_FORCE_CUBLAS
with which cuBLAS is always used. I moved code fromcommon.cuh
to more specialized headers (which is unproblematic becauseggml-cuda.cu
includes them all). I refactored the logic ofggml_cuda_mul_mat
and moved the MMQ selection logic to a functionggml_cuda_should_use_mmq
.