metal : PP speedup (#3084)

Commit

2 years ago

metal : PP speedup (#3084) * Minor speed gains for all quantization types * metal: faster kernel_scale via float4 * Various other speedups for "small" kernels * metal: faster soft_max vial float4 * metal: faster diagonal infinity Although, to me it looks like one should simply fuse scale + diagnonal infinity + soft_max on the KQtensor. * Another faster f16 x f32 matrix multiply kernel * Reverting the diag infinity change It does work for PP, but somehow it fails for TG. Need to look more into it. * metal: add back faster diagonal infinity This time more carefully * metal : minor (readibility) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

References

#3084 - Metal: PP speedup

Author

ikawrakow

Parents

6eeb4d90

llama.cpp f31b6f4e - metal : PP speedup (#3084)

llama.cpp
f31b6f4e - metal : PP speedup (#3084)