Add Q4_3 quantization (ARM NEON) #1082
ggerganov
force pushed
from
0408d1f8
to
eed22aef
2 years ago
ggerganov
force pushed
from
eed22aef
to
dff03c0d
2 years ago
ggml : add Q4_3 quantization
515ccfd2
ggerganov
force pushed
from
dff03c0d
to
515ccfd2
2 years ago
ggerganov
marked this pull request as ready for review 2 years ago
ggerganov
merged
e0305ead
into master 2 years ago
ggerganov
deleted the q4_3 branch 2 years ago
prusnak
approved these changes
on 2023-04-20
Assignees
No one assigned
Initial
Q4_3
implementation runs at ~82 ms / token on M1.Need to see if we can optimize that somehow.
For example
Q4_1
runs at ~55 ms / token, so there is probably lots of room for improvementMerging this, although the speed is not satisfying. We have to try to get it as fast as
Q4_1
.We might have to change the
block_q4_3
if needed to achieve this