llama.cpp
IQ4_XS: a 4.25 bpw quantization
#5747
Merged

IQ4_XS: a 4.25 bpw quantization #5747

ikawrakow merged 11 commits into master from ik/iq4_nl_xs
ikawrakow
Try IQ4_NL with blocks of 64 - does not look good
67264b3b
iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
2b21d37a
iq4_xs: CUDA works - 133.2 t/s
fddbfe83
iq4_xs: AVX2 dot product
061a16f5
iq4_xs: ARM_NEON dot product
a37980c3
iq4_nl: Metal implementation
ad40ae63
iq3_xs: minor fix
6c2b233b
iq4_xs: shrink by using IQ3_S for attn_k and attn_q
5c2b2305
iq4_xs: revert using IQ3_S for attn_k and attn_v
f162fcaf
Fix CI
801f998b
Nexesenex
sorasoras
ikawrakow
iq4_xs: Added forgotten check for 256 divisibility
d7bb4b6d
ikawrakow
sorasoras
CyborgArmy83
Artefact2
ggerganov
ggerganov approved these changes on 2024-02-27
ikawrakow
ikawrakow ikawrakow merged 0becb22a into master 1 year ago
ikawrakow ikawrakow deleted the ik/iq4_nl_xs branch 1 year ago
sorasoras
mofosyne mofosyne added Review Complexity : High
mofosyne mofosyne added Tensor Encoding Scheme

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone