must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?
must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?
It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format
ok I get it, so the only impact is here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L48
which means you could use the exact bit quant of gguf (eg: 3.25 or whatever) for the awq scales/clip computation. Probably wouldn't make a difference but I get it. thanks.
must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?
It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format
so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?
must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?
It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format
so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?
This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.
must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?
It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format
so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?
This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.
AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474
Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size.
It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.
AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.
We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.
AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.
We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.
If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.
AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.
We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.
Isn't the benefit of AWQ limited in this case?
No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.
The practical step of quantizing to a specific format is handled by llama.cpp
while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.
The main benefit is a higher quality model as can be observed from most cases from the perplexity numbers, except for Mixtral which I am working on better quantization for.
If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.
Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit, the gguf quantization target should be also 3-bit.
AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.
We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.
Isn't the benefit of AWQ limited in this case?
No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.
- Scaling: We search for each weight's most optimal scaling factor. We do this by measuring a loss function that uses pseudo-quantization to measure the difference between FP16 and quantized outputs of every layer. We apply these scales (+some weight clipping) after finding the most optimal scaling factor.
- Quantization: This part just converts the scaled weights from FP16 -> INT4. This is a practical step to make sure we can execute in a quantized format. Nothing fancy happens here other than packing the weights in a specific format that is compatible with the implemented CUDA kernel.
The practical step of quantizing to a specific format is handled by
llama.cpp
while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.
Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?
Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?
The reason is that Q3_K_M is a mixed-bit quantization that GGUF applies. That means the Q3_K_M format is not just INT3, but it also has INT4 weights. We observe that INT4 is more effective for scaling in this case, likely because scaling with INT3 makes the quantization error much larger when you apply INT3 scales for INT4 weights. That is likely why we see that AWQ 4-bit works better for the Q3_K_M format.
An optimization in the future in AutoAWQ could include the ability to do mixed-bit scaling. This could likely even improve AWQ quantization if applied thoughtfully, i.e. maybe some losses are higher than others and you could adjust the w_bit and retry to find a better scale.
some result from a Qwen14B model.
Q3MK PPL =9.6685 +/- 0.06744
Q4MK PPL =9.5139 +- 0.06592
Q5MK PPL =9.4058 +/- 0.06490
Q2K PPL =10.8593 +/- 0.07482
Q8_0 PPL = 9.4008 +/- 0.06471
awq4+q4km PPL = 9.4109 +/- 0.06500
awq6bit+q4km PPL = 9.5216 +/- 0.06568
awq6bit+q5km PPL = 9.4202 +/- 0.06487
awq4bit+q3km PPL = 9.6123 +/- 0.06660
awq4bit+q2k PPL = 9.8321 +/- 0.06761
awq3+q2k PPL = 9.9867 +/- 0.06874
Looking forward to furfure mixed-bit scaling to further improvement.
a q4mk qwen14b model
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 20 tensors
llama_model_loader: - type q8_0: 20 tensors
llama_model_loader: - type q4_K: 121 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 1 tensors
@sorasoras Thanks for these numbers! These look particularly good to me. Great improvements, especially Q2 is a large improvement. I just outlined the combinations below, they look good!
Q2:
Q3:
Q4:
sidenote
Have you look into SOTA 2-bit quants
ggml-org/llama.cpp#4773 (comment)
This looks super interesting.
https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main
Perhaps, AWQ could do optimization for this new quants? I am not so sure through.
sidenote Have you look into SOTA 2-bit quants
ggerganov/llama.cpp#4773 (comment) This looks super interesting. https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main
Perhaps, AWQ could do optimization for this new quants? I am not so sure through.
I checked their reference code for their new SOTA KNN/QuIP method. Many elements are similar to AWQ, but there are many unique aspects of this new method that are directly taken from QuIP#. You could certainly try to implement the unique aspects of QuIP# into AutoAWQ like the importance matrix and modification for the E8 lattice search.
However, I don't think it is feasible for me to do these things alone as AutoAWQ is already a large task to maintain mostly by myself. llama-cpp
has a large community of open-source developers, so it is better suited to be implemented over there since they also have a whole framework with specialized formats, cuda kernels, and more that are constantly updated.
https://huggingface.co/ikawrakow/mistral-7b-quantized-gguf/blob/main/README.md has Mistral-7B quants in GGUF format where the perplexity seems lower throughout than what I see for AWQ in the above table. For convenience, here is a copy of the table that you will find there:
Quantization | Model file | PPL(llama.cpp) | Quantization Error | PPL(new quants) | Quantization Error |
---|---|---|---|---|---|
Q3_K_S | mistral-7b-q3ks.gguf | 6.0692 | 6.62% | 6.0021 | 5.44% |
Q3_K_M | mistral-7b-q3km.gguf | 5.8894 | 3.46% | 5.8489 | 2.75% |
Q4_K_S | mistral-7b-q4ks.gguf | 5.7764 | 1.48% | 5.7349 | 0.75% |
Q4_K_M | mistral-7b-q4km.gguf | 5.7539 | 1.08% | 5.7259 | 0.59% |
Q5_K_S | mistral-7b-q5ks.gguf | 5.7258 | 0.59% | 5.7100 | 0.31% |
Q4_0 | mistral-7b-q40.gguf | 5.8189 | 2.23% | 5.7924 | 1.76% |
Q4_1 | mistral-7b-q41.gguf | 5.8244 | 2.32% | 5.7455 | 0.94% |
Q5_0 | mistral-7b-q50.gguf | 5.7180 | 0.45% | 5.7070 | 0.26% |
Q5_1 | mistral-7b-q51.gguf | 5.7128 | 0.36% | 5.7057 | 0.24% |
Login to write a write a comment.
AWQ has only ever been able to run 4-bit quantization. However, with this integration, we can run any-bit quantization and export to
llama.cpp
for inference. This results in lower perplexity while ensuring compatibility with the GGUF ecosystem.The difference between GGUF and AWQ is most pronounced on the q_0 and q_1 models but I include most perplexity numbers for the K method from llama.cpp since it reaches the lowest perplexity.
Perplexity
Perplexity measured with:
./perplexity -m <gguf_model> -f wikitext-2-raw/wiki.test.raw -ngl 33
Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)
FP16: 5.6934
Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)
Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)