llama.cpp
cuda : improve text-generation and batched decoding performance
#3776
Merged

cuda : improve text-generation and batched decoding performance #3776

ggerganov merged 7 commits into master from cuda-quantum-batch
ggerganov
ggerganov cuda : prints wip
59d1232e
ggerganov cuda : new cublas gemm branch for multi-batch quantized src0
52af7826
ggerganov cuda : add F32 sgemm branch
16b60dd7
ggerganov cuda : fine-tune >= VOLTA params + use MMQ only for small batches
a3c28439
slaren
JohannesGaessler
ggerganov
JohannesGaessler
ggerganov
JohannesGaessler
slaren
JohannesGaessler
ggerganov
ggerganov cuda : remove duplicated cuBLAS GEMM code
4c6744b5
ggerganov cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
a4e15a36
ggerganov ggerganov changed the title cuda : improve batched decoding performance for quantum models cuda : improve text-generation and batched decoding performance for quantum models 1 year ago
ggerganov
oobabooga
ggerganov ggerganov changed the title cuda : improve text-generation and batched decoding performance for quantum models cuda : improve text-generation and batched decoding performance 1 year ago
JohannesGaessler
Dampfinchen
oobabooga
JohannesGaessler
Ph0rk0z
Tostino
Dampfinchen
ggerganov build : add compile option to force use of MMQ kernels
49af767f
JohannesGaessler
Ph0rk0z
ggerganov
JohannesGaessler
ggerganov
ggerganov ggerganov merged 2f9ec7e2 into master 1 year ago
Ph0rk0z
JohannesGaessler
ggerganov
Ph0rk0z
LostRuins
ggerganov
Dampfinchen
cebtenzzre
ggerganov
Dampfinchen
Dampfinchen
slaren
Dampfinchen
Dampfinchen
slaren
Dampfinchen

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone