llama.cpp
cuda : improve text-generation and batched decoding performance
#3776

Merged

cuda : improve text-generation and batched decoding performance #3776

ggerganov merged 7 commits into master from cuda-quantum-batch

cuda : prints wip

59d1232e

cuda : new cublas gemm branch for multi-batch quantized src0

52af7826

cuda : add F32 sgemm branch

16b60dd7

cuda : fine-tune >= VOLTA params + use MMQ only for small batches

a3c28439

cuda : remove duplicated cuBLAS GEMM code

4c6744b5

cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

a4e15a36

ggerganov changed the title ~~cuda : improve batched decoding performance for quantum models~~ cuda : improve text-generation and batched decoding performance for quantum models 1 year ago

ggerganov changed the title ~~cuda : improve text-generation and batched decoding performance for quantum models~~ cuda : improve text-generation and batched decoding performance 1 year ago