llama.cpp
Add Q4_3 quantization (ARM NEON)
#1082
Merged

Add Q4_3 quantization (ARM NEON) #1082

ggerganov merged 1 commit into master from q4_3
ggerganov
ggerganov2 years ago (edited 2 years ago)

Initial Q4_3 implementation runs at ~82 ms / token on M1.
Need to see if we can optimize that somehow.

For example Q4_1 runs at ~55 ms / token, so there is probably lots of room for improvement

#define QK4_3 16
typedef struct {
    ggml_fp16_t d;         // delta
    ggml_fp16_t m;         // min
    uint8_t qs[QK4_3 / 2]; // nibbles / quants
} block_q4_3;

Merging this, although the speed is not satisfying. We have to try to get it as fast as Q4_1.
We might have to change the block_q4_3 if needed to achieve this

ggerganov ggerganov force pushed from 0408d1f8 to eed22aef 2 years ago
ggerganov ggerganov force pushed from eed22aef to dff03c0d 2 years ago
ggerganov ggml : add Q4_3 quantization
515ccfd2
ggerganov ggerganov force pushed from dff03c0d to 515ccfd2 2 years ago
ggerganov ggerganov marked this pull request as ready for review 2 years ago
ggerganov ggerganov merged e0305ead into master 2 years ago
ggerganov ggerganov deleted the q4_3 branch 2 years ago
prusnak
prusnak approved these changes on 2023-04-20
prusnak2 years ago

M1 16 GB benchmark:

7B q4_3 4 threads: 180 ms/token
7B q4_3 8 threads: 280 ms/token

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone