llama.cpp
a128c38d - Fix ffn_down quantization mix for MoE models (#4927)

Commit

2 years ago

Fix ffn_down quantization mix for MoE models (#4927) * Fix ffn_down quantization mix for MoE models In #4872 I did not consider the part where every third tensor is quantized with more bits. Fir MoE this leads to tensors of the same layer being quantized with different number of bits, which is not considered as a possibility in the inference implementation (it is assumed all experts use the same quantization). * Fix the fix * Review suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

References

#4927 - Fix ffn_down quantization mix for MoE models

Author

ikawrakow

Parents

5f5fe1bd

llama.cpp a128c38d - Fix ffn_down quantization mix for MoE models (#4927)

llama.cpp
a128c38d - Fix ffn_down quantization mix for MoE models (#4927)