llama.cpp
IQ4_XS: a 4.25 bpw quantization
#5747
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
11
Changes
View On
GitHub
IQ4_XS: a 4.25 bpw quantization
#5747
ikawrakow
merged 11 commits into
master
from
ik/iq4_nl_xs
Try IQ4_NL with blocks of 64 - does not look good
67264b3b
iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
2b21d37a
iq4_xs: CUDA works - 133.2 t/s
fddbfe83
iq4_xs: AVX2 dot product
061a16f5
iq4_xs: ARM_NEON dot product
a37980c3
iq4_nl: Metal implementation
ad40ae63
iq3_xs: minor fix
6c2b233b
iq4_xs: shrink by using IQ3_S for attn_k and attn_q
5c2b2305
iq4_xs: revert using IQ3_S for attn_k and attn_v
f162fcaf
Fix CI
801f998b
iq4_xs: Added forgotten check for 256 divisibility
d7bb4b6d
ggerganov
approved these changes on 2024-02-27
ikawrakow
merged
0becb22a
into master
1 year ago
ikawrakow
deleted the ik/iq4_nl_xs branch
1 year ago
mofosyne
added
Review Complexity : High
mofosyne
added
Tensor Encoding Scheme
Login to write a write a comment.
Login via GitHub
Reviewers
ggerganov
Assignees
No one assigned
Labels
Review Complexity : High
Tensor Encoding Scheme
Milestone
No milestone
Login to write a write a comment.
Login via GitHub