llama.cpp
IQ1_M: 1.75 bpw quantization
#6302
Merged

IQ1_M: 1.75 bpw quantization #6302

ikawrakow merged 24 commits into master from ik/iq1_m_new
ikawrakow
iq1_m: basics
2a2d66de
iq1_m: basics-2
ac8b3dd2
iq1_m: CUDA dequantize works
1df37b65
iq1_m: separate shifts for each group of 8 in a block
282f2788
iq1_m: go to 3-bit scales
308c50d0
iq1_m: scalar dot product
64b9dfd7
iq1_m: AVX2 dot product
a139de51
iq1_m: very slightly faster AVX2 dot product
379fdb67
iq1_m: ARM_NEON dot product
8009b6d6
iq1_m: Metal - dequantize works, dot product does not
0e36afa0
iq1_m: Metal now works
19fb974d
iq1_m: minor
abc1d4f9
iq1_m: checking pure iq1_m quantization
dff85a80
iiq1_m: slightly faster ARM_NEON dot product
f664692f
iq1_m: faster ARM_NEON dot product
b1d1c260
iq1_m: another minor ARM_NEON dot product improvement
78ce561a
iq1_m: small PPL improvement via super-block scale adjustment
3d9c21f6
iq1_m: adapt to CUDA refactoring
480d6d6c
ikawrakow ikawrakow force pushed to 480d6d6c 1 year ago
iq1_m: remove unused variable
62dd11f3
iq1_m: add to backend-ops tests
22fa1213
Nexesenex
slaren
slaren commented on 2024-03-25
iq1_m: fix Windows ARM
b68f32b3
iq1_m: use common definition of iq1m_scale_t
9a5786e9
cuda: assert -> NO_DEVICE_CODE
cdb2d65c
ggerganov
ggerganov approved these changes on 2024-03-26
iq1_M: PR comments
6e4cef5d
ikawrakow ikawrakow merged 55c1b2a3 into master 1 year ago
ikawrakow ikawrakow deleted the ik/iq1_m_new branch 1 year ago
Nexesenex
ikawrakow
ikawrakow
Nexesenex
mofosyne mofosyne added Tensor Encoding Scheme
mofosyne mofosyne added Review Complexity : High

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone