llama.cpp
13731766 - llamafile : ppc64le GEMV forwarding for FP32. (#12594)

Commit

1 year ago

llamafile : ppc64le GEMV forwarding for FP32. (#12594) This patch enables usage of MMA when one of the dimensions of the matrix(ie either M or N) is 1. This is useful in case of token generation where N < 2. The concept of 'GEMV Forwarding' is used where when one of the matrix has a single row/column, the elements are broadcasted, instead of using packing routine to prepack the matrix elements. This change results in 5% - 15% improvement in total speed(ie all tokens/total time), across various batch sizes. This is in comparision with the corresponding dot product implementation. The patch is tested with FP32 models of Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>

References

#12594 - llamafile : ppc64le GEMV forwarding for FP32.

Author

amritahs-ibm

Parents

ab6ab8f8

llama.cpp 13731766 - llamafile : ppc64le GEMV forwarding for FP32. (#12594)

llama.cpp
13731766 - llamafile : ppc64le GEMV forwarding for FP32. (#12594)