llama.cpp
Better 1.5 bit quantization
#5971
Merged

Better 1.5 bit quantization #5971

ikawrakow merged 15 commits into master from ik/iq1s_blocks16
ikawrakow
ikawrakow ikawrakow added breaking change
ikawrakow ikawrakow force pushed 2 years ago
Green-Sky
ikawrakow
ggerganov
ggerganov
ggerganov approved these changes on 2024-03-10
Artefact2
Trying blocvks of 16 for IQ1_S - seems slightly better
c9e9acf2
iq1s_blocks16: Adjust scale fudge factor to 1.125
cd83a7d3
iq1s_blocks16: going to blocks of 32
4c4404ac
iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment
c55e66f9
iq1s_blocks16: scalar and AVX2 dot products
864a5c2c
iq1s_blocks16: CUDA dot product
f092d049
iq1s_blocks16: Metal works, Neon does not
fbb001e6
iq1s_blocks16: fixed Neon
15acc792
iq1s_blocks16: very slightly faster TG on Metal
8561139a
iq1s_blocks16: speedup Metal by packing codebook into uint32_t's
d3da9d16
Formatting
7545d693
iq1s_blocks16: uint32_t codebook is also better in CUDA
156220f8
iq1s_blocks16: slightly faster Neon dot product
101b18d5
iq1s_blocks16: faster AVX2 dot product
34bc21ff
iq1s_blocks16: adjust to ggml-common.h
9d831712
ikawrakow ikawrakow force pushed to 9d831712 2 years ago
ikawrakow ikawrakow merged be858f62 into master 2 years ago
ikawrakow ikawrakow deleted the ik/iq1s_blocks16 branch 2 years ago
ikawrakow

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone