PR #6302 IQ1_M: 1.75 bpw quantization

IQ1_M: 1.75 bpw quantization #6302

ikawrakow merged 24 commits into master from ik/iq1_m_new

iq1_m: basics

2a2d66de

iq1_m: basics-2

ac8b3dd2

iq1_m: CUDA dequantize works

1df37b65

iq1_m: separate shifts for each group of 8 in a block

282f2788

iq1_m: go to 3-bit scales

308c50d0

iq1_m: scalar dot product

64b9dfd7

iq1_m: AVX2 dot product

a139de51

iq1_m: very slightly faster AVX2 dot product

379fdb67

iq1_m: ARM_NEON dot product

8009b6d6

iq1_m: Metal - dequantize works, dot product does not

0e36afa0

iq1_m: Metal now works

19fb974d

iq1_m: minor

abc1d4f9

iq1_m: checking pure iq1_m quantization

dff85a80

iiq1_m: slightly faster ARM_NEON dot product

f664692f

iq1_m: faster ARM_NEON dot product

b1d1c260

iq1_m: another minor ARM_NEON dot product improvement

78ce561a

iq1_m: small PPL improvement via super-block scale adjustment

3d9c21f6

iq1_m: adapt to CUDA refactoring

480d6d6c

ikawrakow force pushed to 480d6d6c 2 years ago

iq1_m: remove unused variable

62dd11f3

iq1_m: add to backend-ops tests

22fa1213

slaren commented on 2024-03-25

iq1_m: fix Windows ARM

b68f32b3

iq1_m: use common definition of iq1m_scale_t

9a5786e9

cuda: assert -> NO_DEVICE_CODE

cdb2d65c

ggerganov approved these changes on 2024-03-26

iq1_M: PR comments

6e4cef5d

ikawrakow merged 55c1b2a3 into master 2 years ago

ikawrakow deleted the ik/iq1_m_new branch 2 years ago

mofosyne added Tensor Encoding Scheme

mofosyne added Review Complexity : High

Reviewers

ggerganov

slaren

Assignees

No one assigned

Labels

Review Complexity : High Tensor Encoding Scheme

Milestone

No milestone

llama.cpp IQ1_M: 1.75 bpw quantization #6302 Merged

IQ1_M: 1.75 bpw quantization #6302

llama.cpp
IQ1_M: 1.75 bpw quantization
#6302

Merged