llama.cpp
cuda : improve text-generation and batched decoding performance
#3776
Merged

Commits
  • cuda : prints wip
    ggerganov committed 2 years ago
  • cuda : new cublas gemm branch for multi-batch quantized src0
    ggerganov committed 2 years ago
  • cuda : add F32 sgemm branch
    ggerganov committed 2 years ago
  • cuda : fine-tune >= VOLTA params + use MMQ only for small batches
    ggerganov committed 2 years ago
  • cuda : remove duplicated cuBLAS GEMM code
    ggerganov committed 2 years ago
  • cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
    ggerganov committed 2 years ago
  • build : add compile option to force use of MMQ kernels
    ggerganov committed 2 years ago
Loading