llama.cpp
2f9ec7e2 - cuda : improve text-generation and batched decoding performance (#3776)

Commit

2 years ago

cuda : improve text-generation and batched decoding performance (#3776) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels

References

#3776 - cuda : improve text-generation and batched decoding performance

Author

ggerganov

Parents

34b2a5e1

llama.cpp 2f9ec7e2 - cuda : improve text-generation and batched decoding performance (#3776)

llama.cpp
2f9ec7e2 - cuda : improve text-generation and batched decoding performance (#3776)