llama.cpp
Better 1.5 bit quantization
#5971
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
15
Changes
View On
GitHub
Better 1.5 bit quantization
#5971
ikawrakow
merged 15 commits into
master
from
ik/iq1s_blocks16
ikawrakow
added
breaking change
ikawrakow
force pushed
2 years ago
ggerganov
approved these changes on 2024-03-10
Trying blocvks of 16 for IQ1_S - seems slightly better
c9e9acf2
iq1s_blocks16: Adjust scale fudge factor to 1.125
cd83a7d3
iq1s_blocks16: going to blocks of 32
4c4404ac
iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment
c55e66f9
iq1s_blocks16: scalar and AVX2 dot products
864a5c2c
iq1s_blocks16: CUDA dot product
f092d049
iq1s_blocks16: Metal works, Neon does not
fbb001e6
iq1s_blocks16: fixed Neon
15acc792
iq1s_blocks16: very slightly faster TG on Metal
8561139a
iq1s_blocks16: speedup Metal by packing codebook into uint32_t's
d3da9d16
Formatting
7545d693
iq1s_blocks16: uint32_t codebook is also better in CUDA
156220f8
iq1s_blocks16: slightly faster Neon dot product
101b18d5
iq1s_blocks16: faster AVX2 dot product
34bc21ff
iq1s_blocks16: adjust to ggml-common.h
9d831712
ikawrakow
force pushed
to
9d831712
2 years ago
ikawrakow
merged
be858f62
into master
2 years ago
ikawrakow
deleted the ik/iq1s_blocks16 branch
2 years ago
Login to write a write a comment.
Login via GitHub
Reviewers
ggerganov
Assignees
No one assigned
Labels
breaking change
Milestone
No milestone
Login to write a write a comment.
Login via GitHub