cuda : improve text-generation and batched decoding performance #3776
cuda : prints wip
59d1232e
cuda : new cublas gemm branch for multi-batch quantized src0
52af7826
cuda : add F32 sgemm branch
16b60dd7
cuda : fine-tune >= VOLTA params + use MMQ only for small batches
a3c28439
cuda : remove duplicated cuBLAS GEMM code
4c6744b5
cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
a4e15a36
ggerganov
changed the title cuda : improve batched decoding performance for quantum models cuda : improve text-generation and batched decoding performance for quantum models 1 year ago
ggerganov
changed the title cuda : improve text-generation and batched decoding performance for quantum models cuda : improve text-generation and batched decoding performance 1 year ago
build : add compile option to force use of MMQ kernels
49af767f
ggerganov
merged
2f9ec7e2
into master 1 year ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub