llama.cpp
2b4ea35e - cuda : add batched cuBLAS GEMM for faster attention (#3749)

Commit

2 years ago

cuda : add batched cuBLAS GEMM for faster attention (#3749) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

References

#3749 - cuda : add batched cuBLAS GEMM for faster attention

Author

ggerganov

Parents

daab3d7f

llama.cpp 2b4ea35e - cuda : add batched cuBLAS GEMM for faster attention (#3749)

llama.cpp
2b4ea35e - cuda : add batched cuBLAS GEMM for faster attention (#3749)