llama.cpp
SOTA 3-bit quants
#5196
Merged

SOTA 3-bit quants #5196

ikawrakow merged 14 commits into master from ik/iq3_xxs
ikawrakow
iq3_xxs: quantize/dequantize
8524d277
iq3_xxs: CUDA dequantize works
bf9349c6
iq2_xxs: tuning quantization
90faca24
iq3_xxs: starting to look better
f1206729
iq3_xxs: CUDA dot product
f1875b0a
iq3_xxs: scalar and AVX2 dot products
c3b20296
iq3_xxs: ARM_NEON and Metal
15493023
iq3_xxs: slightly better grid points
51cde193
Faster iq3_xxs and iq2_xs dot products on CUDA
68cfcd47
iq3_xxs: add some quant mix
7e4e7488
Artefact2
sorasoras
iq3_xxs: fix failing quantization test
6efbc690
iq3_xxs: hopefully fix ROCm
62623434
ikawrakow
ikawrakow
sorasoras
iq3_xxs: failing tests
fe2160ee
Artefact2
JianbangZ
sorasoras
JiHa-Kim
ikawrakow
ggerganov
ggerganov approved these changes on 2024-01-30
Add IQ3_XXS to test-backend-ops
fb6576bc
ikawrakow ikawrakow merged f4d7e549 into master 1 year ago
ikawrakow ikawrakow deleted the ik/iq3_xxs branch 1 year ago
ggerganov
eramax
JiHa-Kim
mofosyne mofosyne added Review Complexity : High
mofosyne mofosyne added Tensor Encoding Scheme

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone