PR #285 GGUF compatible quantization (2, 3, 4 bit / any bit)

casper-hansen1 year ago (edited 1 year ago)

AWQ has only ever been able to run 4-bit quantization. However, with this integration, we can run any-bit quantization and export to llama.cpp for inference. This results in lower perplexity while ensuring compatibility with the GGUF ecosystem.

The difference between GGUF and AWQ is most pronounced on the q_0 and q_1 models but I include most perplexity numbers for the K method from llama.cpp since it reaches the lowest perplexity.

Perplexity

Perplexity measured with: ./perplexity -m <gguf_model> -f wikitext-2-raw/wiki.test.raw -ngl 33

Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)

FP16: 5.6934

Method	Perplexity (duo_scaling=True)	Perplexity (duo_scaling=False)
GGUF Q2_K	6.1640 +/- 0.03474
GGUF Q3_K_M	5.8881 +/- 0.03320
GGUF Q4_0	5.8189 +/- 0.03257
GGUF Q4_K_M	5.7518 +/- 0.03231
AWQ 2-bit + GGUF Q2_K	6.7290 +/- 0.04008
AWQ 3-bit + GGUF Q2_K	6.1079 +/- 0.03482	6.1147 +/- 0.03474
AWQ 3-bit + GGUF Q3_K_M	5.9123 +/- 0.03376	5.9072 +/- 0.03362
AWQ 4-bit + GGUF Q3_K_M	5.8528 +/- 0.03304	5.8570 +/- 0.03306
AWQ 4-bit + GGUF Q4_0	5.8127 +/- 0.03289	5.8018 +/- 0.03272
AWQ 4-bit + GGUF Q4_K_M	5.7415 +/- 0.03237	5.7396 +/- 0.03230
AWQ 6-bit + GGUF Q4_0	5.8064 +/- 0.03261	5.8030 +/- 0.03259
AWQ 6-bit + GGUF Q4_K_M	5.7442 +/- 0.03226	5.7425 +/- 0.03226

Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)

Method	Perplexity (duo_scaling=True)
GGUF Q2_K	7.8406 +/- 0.04688
GGUF Q3_K_M	4.4192 +/- 0.02298
GGUF Q4_0	4.2242 +/- 0.02167
GGUF Q4_K_M	4.2499 +/- 0.02188
AWQ 2-bit + GGUF Q2_K
AWQ 3-bit + GGUF Q2_K
AWQ 3-bit + GGUF Q3_K_M
AWQ 4-bit + GGUF Q3_K_M	4.4301 +/- 0.02294
AWQ 4-bit + GGUF Q4_0	4.2696 +/- 0.02182
AWQ 4-bit + GGUF Q4_K_M	4.2239 +/- 0.02158
AWQ 6-bit + GGUF Q4_0
AWQ 6-bit + GGUF Q4_K_M

Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)

Method	Perplexity
GGUF Q2_K	8.5820 +/- 0.05855
GGUF Q3_K_M	7.8605 +/- 0.05264
GGUF Q4_0	7.8797 +/- 0.05373
GGUF Q4_K_M	7.7172 +/- 0.05209
AWQ 3-bit + GGUF Q2_K	8.6528 +/- 0.05887
AWQ 3-bit + GGUF Q3_K_M	7.9620 +/- 0.05289
AWQ 4-bit + GGUF Q3_K_M	7.8312 +/- 0.05226
AWQ 4-bit + GGUF Q4_0	7.8115 +/- 0.05293
AWQ 4-bit + GGUF Q4_K_M	7.7438 +/- 0.05220
AWQ 6-bit + GGUF Q4_K_M	7.7372 +/- 0.05234

GGUF compatible quantization (2, 3, 4 bit)

a0cb9e57

Update example

b02263f6

casper-hansen changed the title ~~GGUF compatible quantization (2, 3, 4 bit)~~ GGUF compatible quantization (2, 3, 4 bit / any bit) 1 year ago

Change default model to Mistral

8bbf7432

vince62s1 year ago

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

casper-hansen1 year ago

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

vince62s1 year ago👍 1

ok I get it, so the only impact is here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L48
which means you could use the exact bit quant of gguf (eg: 3.25 or whatever) for the awq scales/clip computation. Probably wouldn't make a difference but I get it. thanks.

sorasoras1 year ago

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

casper-hansen1 year ago

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.

sorasoras1 year ago

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474
Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size.
It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

casper-hansen1 year ago (edited 1 year ago)

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

JianbangZ1 year ago (edited 1 year ago)👍 1

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.

casper-hansen1 year ago

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

Isn't the benefit of AWQ limited in this case?

No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.

Scaling: We search for each weight's most optimal scaling factor. We do this by measuring a loss function that uses pseudo-quantization to measure the difference between FP16 and quantized outputs of every layer. We apply these scales (+some weight clipping) after finding the most optimal scaling factor.
Quantization: This part just converts the scaled weights from FP16 -> INT4. This is a practical step to make sure we can execute in a quantized format. Nothing fancy happens here other than packing the weights in a specific format that is compatible with the implemented CUDA kernel.

The practical step of quantizing to a specific format is handled by llama.cpp while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.

https://arxiv.org/pdf/2306.00978.pdf

casper-hansen1 year ago

The main benefit is a higher quality model as can be observed from most cases from the perplexity numbers, except for Mixtral which I am working on better quantization for.

JianbangZ1 year ago

If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit, the gguf quantization target should be also 3-bit.

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

Isn't the benefit of AWQ limited in this case?

No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.

Scaling: We search for each weight's most optimal scaling factor. We do this by measuring a loss function that uses pseudo-quantization to measure the difference between FP16 and quantized outputs of every layer. We apply these scales (+some weight clipping) after finding the most optimal scaling factor.

Quantization: This part just converts the scaled weights from FP16 -> INT4. This is a practical step to make sure we can execute in a quantized format. Nothing fancy happens here other than packing the weights in a specific format that is compatible with the implemented CUDA kernel.

The practical step of quantizing to a specific format is handled by llama.cpp while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.

https://arxiv.org/pdf/2306.00978.pdf

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?

casper-hansen1 year ago👍 3

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?

The reason is that Q3_K_M is a mixed-bit quantization that GGUF applies. That means the Q3_K_M format is not just INT3, but it also has INT4 weights. We observe that INT4 is more effective for scaling in this case, likely because scaling with INT3 makes the quantization error much larger when you apply INT3 scales for INT4 weights. That is likely why we see that AWQ 4-bit works better for the Q3_K_M format.

An optimization in the future in AutoAWQ could include the ability to do mixed-bit scaling. This could likely even improve AWQ quantization if applied thoughtfully, i.e. maybe some losses are higher than others and you could adjust the w_bit and retry to find a better scale.

sorasoras1 year ago

some result from a Qwen14B model.
Q3MK PPL =9.6685 +/- 0.06744
Q4MK PPL =9.5139 +- 0.06592
Q5MK PPL =9.4058 +/- 0.06490
Q2K PPL =10.8593 +/- 0.07482
Q8_0 PPL = 9.4008 +/- 0.06471
awq4+q4km PPL = 9.4109 +/- 0.06500
awq6bit+q4km PPL = 9.5216 +/- 0.06568
awq6bit+q5km PPL = 9.4202 +/- 0.06487
awq4bit+q3km PPL = 9.6123 +/- 0.06660
awq4bit+q2k PPL = 9.8321 +/- 0.06761
awq3+q2k PPL = 9.9867 +/- 0.06874

Looking forward to furfure mixed-bit scaling to further improvement.
a q4mk qwen14b model
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 20 tensors
llama_model_loader: - type q8_0: 20 tensors
llama_model_loader: - type q4_K: 121 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 1 tensors

casper-hansen1 year ago

@sorasoras Thanks for these numbers! These look particularly good to me. Great improvements, especially Q2 is a large improvement. I just outlined the combinations below, they look good!

Q2:

Q2K PPL =10.8593 +/- 0.07482
awq4bit+q2k PPL = 9.8321 +/- 0.06761

Q3:

Q3MK PPL =9.6685 +/- 0.06744
awq4bit+q3km PPL = 9.6123 +/- 0.06660

Q4:

Q4MK PPL =9.5139 +- 0.06592
awq4+q4km PPL = 9.4109 +/- 0.06500

Merge branch 'main' into gguf

0b40094b

pack() utility function. rename gguf_compatible -> export_compatible.

c7eae1b2

sorasoras1 year ago

sidenote
Have you look into SOTA 2-bit quants

ggml-org/llama.cpp#4773 (comment)
This looks super interesting.
https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Perhaps, AWQ could do optimization for this new quants? I am not so sure through.

casper-hansen1 year ago

sidenote Have you look into SOTA 2-bit quants

ggerganov/llama.cpp#4773 (comment) This looks super interesting. https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Perhaps, AWQ could do optimization for this new quants? I am not so sure through.

I checked their reference code for their new SOTA KNN/QuIP method. Many elements are similar to AWQ, but there are many unique aspects of this new method that are directly taken from QuIP#. You could certainly try to implement the unique aspects of QuIP# into AutoAWQ like the importance matrix and modification for the E8 lattice search.

However, I don't think it is feasible for me to do these things alone as AutoAWQ is already a large task to maintain mostly by myself. llama-cpp has a large community of open-source developers, so it is better suited to be implemented over there since they also have a whole framework with specialized formats, cuda kernels, and more that are constantly updated.

ikawrakow1 year ago

https://huggingface.co/ikawrakow/mistral-7b-quantized-gguf/blob/main/README.md has Mistral-7B quants in GGUF format where the perplexity seems lower throughout than what I see for AWQ in the above table. For convenience, here is a copy of the table that you will find there:

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q3_K_S	mistral-7b-q3ks.gguf	6.0692	6.62%	6.0021	5.44%
Q3_K_M	mistral-7b-q3km.gguf	5.8894	3.46%	5.8489	2.75%
Q4_K_S	mistral-7b-q4ks.gguf	5.7764	1.48%	5.7349	0.75%
Q4_K_M	mistral-7b-q4km.gguf	5.7539	1.08%	5.7259	0.59%
Q5_K_S	mistral-7b-q5ks.gguf	5.7258	0.59%	5.7100	0.31%
Q4_0	mistral-7b-q40.gguf	5.8189	2.23%	5.7924	1.76%
Q4_1	mistral-7b-q41.gguf	5.8244	2.32%	5.7455	0.94%
Q5_0	mistral-7b-q50.gguf	5.7180	0.45%	5.7070	0.26%
Q5_1	mistral-7b-q51.gguf	5.7128	0.36%	5.7057	0.24%

casper-hansen merged a3db8099 into main 1 year ago

casper-hansen deleted the gguf branch 1 year ago

AutoAWQ
GGUF compatible quantization (2, 3, 4 bit / any bit)
#285

Merged

GGUF compatible quantization (2, 3, 4 bit / any bit) #285

Perplexity

Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)

Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)

Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)

AutoAWQ GGUF compatible quantization (2, 3, 4 bit / any bit) #285 Merged

GGUF compatible quantization (2, 3, 4 bit / any bit) #285

Perplexity

Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)

Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)

Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)

AutoAWQ
GGUF compatible quantization (2, 3, 4 bit / any bit)
#285

Merged