llama.cpp
49662cbe - ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)

Commit

2 years ago

ggml : SOTA 2-bit quants (add IQ2_XS) (#4856) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

References

#4856 - SOTA 2-bit quants - part 2

Author

ikawrakow

Parents

3ba5b8ca

llama.cpp 49662cbe - ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)

llama.cpp
49662cbe - ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)