llama.cpp
CUDA: use MMQ instead of cuBLAS by default
#8075
Merged

CUDA: use MMQ instead of cuBLAS by default #8075

JohannesGaessler
JohannesGaessler332 days ago👍 2🎉 1

This PR makes it so that by default mul_mat_q instead of FP16 cuBLAS GEMM is used unless

  • there is no int8 tensor core MMQ implementation available but FP16 tensor cores are (V100, RDNA3) or
  • the __dp4a instruction is unavailable (P100 or older).

Performance comparisons can be found in #8062 . To make the new kernels actually available I added compute capability 7.5 to CMake. I added a new compilation option LLAMA_CUDA_FORCE_CUBLAS with which cuBLAS is always used. I moved code from common.cuh to more specialized headers (which is unproblematic because ggml-cuda.cu includes them all). I refactored the logic of ggml_cuda_mul_mat and moved the MMQ selection logic to a function ggml_cuda_should_use_mmq.

github-actions github-actions added build
github-actions github-actions added Nvidia GPU
github-actions github-actions added ggml
Dampfinchen
Dampfinchen332 days ago

Nice work making MMQ so fast!

Are IQ-Quants supported by the recent speedups? If not, perhaps it's possible to still use cublas for these by default, as many people like to use iq-quants.

JohannesGaessler
JohannesGaessler332 days ago

Only legacy quants and K-quants have a MMQ implementation at all. For all other data formats cuBLAS is the only option available and there is no change.

Nexesenex
Nexesenex332 days ago (edited 332 days ago)

Would it be possible to have a command line argument to chose mmq, or cublas, as long as the corresponding architectures are compiled? It'd be great for simplicity of choice, and also for downstream implementations like in KoboldCPP.
Also, to mention in the CMakeList what arch is compatible / faster for each NVIDIA chips generation since Kepler/Maxwell.
And, to make it clear for the profane, does mmvq automatically triggers if mmq mode is on and the blas batch size =< 8?

slaren
slaren332 days ago

In what cases would you want to use cuBLAS? Command line options have to go through llama.cpp, which requires changes to the llama.cpp API, and then they have to be passed to the backend, which requires adding more exceptions for some backends. They should not be added unless there is a very good reason to do so.

JohannesGaessler
JohannesGaessler332 days ago👍 1

It could maybe be done via environment variables instead which would require no changes to the CLI. But with the current structure where the choice is made at compile time you can skip some kernel variants that you know will never be used so there would be an increase in compilation time and binary size if you were to make it dynamic.

Nexesenex
Nexesenex332 days ago

@slaren : in case MMQ doesn't work or performs badly for some reason, Cublas other might, that's my simple "user based" thinking. If everything is always optimal by default as long as the proper architectures are compiled, then my request is irrelevant, but is it always the case?

This being said, I understand well enough your argument and its precedence.

@JohannesGaessler That would be great, especially if much simpler to implement and maintain. Compiling time or binary size doesn't bother me, as long as the resulting binaries offer a maximum amount of flexibility to the final users with an existing but even more modest tech litteracy than my own.

slaren
slaren332 days ago

An environment variable would be much less intrusive, but I don't think it is a good idea to add more environment variables as a preventive measure.

slaren
slaren commented on 2024-06-24
Conversation is marked as resolved
Show resolved
ggml-cuda.cu
1896 const int cc = ggml_cuda_info().devices[id].cc;
1897 use_mul_mat_vec_q = use_mul_mat_vec_q && cc >= MIN_CC_DP4A;
1898 use_mul_mat_q = use_mul_mat_q && ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[1]);
1899
any_gpus_with_slow_fp16 = any_gpus_with_slow_fp16 && !fast_fp16_available(cc);
slaren331 days ago

This is always false, did you mean to use ||?

JohannesGaessler331 days ago

Thanks, you are correct, this should be ||. I only tested with a single GPU and didn't notice.

JohannesGaessler JohannesGaessler force pushed from cc029763 to 5479853c 331 days ago
mofosyne mofosyne added Review Complexity : Medium
slaren
slaren approved these changes on 2024-06-24
JohannesGaessler CUDA: use MMQ instead of cuBLAS by default
61f3cb6e
JohannesGaessler JohannesGaessler force pushed from 5479853c to 61f3cb6e 331 days ago
JohannesGaessler JohannesGaessler merged a818f302 into master 331 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone