PR #5971 Better 1.5 bit quantization

Better 1.5 bit quantization #5971

ikawrakow merged 15 commits into master from ik/iq1s_blocks16

ikawrakow added breaking change

ikawrakow force pushed 2 years ago

ggerganov approved these changes on 2024-03-10

Trying blocvks of 16 for IQ1_S - seems slightly better

c9e9acf2

iq1s_blocks16: Adjust scale fudge factor to 1.125

cd83a7d3

iq1s_blocks16: going to blocks of 32

4c4404ac

iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

c55e66f9

iq1s_blocks16: scalar and AVX2 dot products

864a5c2c

iq1s_blocks16: CUDA dot product

f092d049

iq1s_blocks16: Metal works, Neon does not

fbb001e6

iq1s_blocks16: fixed Neon

15acc792

iq1s_blocks16: very slightly faster TG on Metal

8561139a

iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

d3da9d16

Formatting

7545d693

iq1s_blocks16: uint32_t codebook is also better in CUDA

156220f8

iq1s_blocks16: slightly faster Neon dot product

101b18d5

iq1s_blocks16: faster AVX2 dot product

34bc21ff

iq1s_blocks16: adjust to ggml-common.h

9d831712

ikawrakow force pushed to 9d831712 2 years ago

ikawrakow merged be858f62 into master 2 years ago

ikawrakow deleted the ik/iq1s_blocks16 branch 2 years ago

Reviewers

ggerganov

Assignees

No one assigned

Labels

breaking change

Milestone

No milestone

llama.cpp Better 1.5 bit quantization #5971 Merged

Better 1.5 bit quantization #5971

llama.cpp
Better 1.5 bit quantization
#5971

Merged