llama.cpp
02810c7a - Fix and restrict NVFP4 edge-cases in llama-graph (#24331)

Commit

3 days ago

Fix and restrict NVFP4 edge-cases in llama-graph (#24331) * Move post-GEMM MUL required for dequant b4 lora and bias add see https://github.com/ggml-org/llama.cpp/pull/23484 : 1. For lora, I would presume we want fully dequantized values before doing the residuals, but this depends on how the LORAs were generated. Literature tells me LORA happens post-mul but pre-bias add https://github.com/ggml-org/llama.cpp/pull/8332 2. For ModelOPT, bias-add should happen on [fully-dequantized values](https://github.com/NVIDIA/Model-Optimizer/blob/b49f9b9e2d747af992d78a3aa7f10efe5a8847e1/modelopt/torch/quantization/backends/nvfp4_gemm.py#L59-L64) * Restrict build_ffn for NVFP4 to supported combinations

References

#24331 - Fix and restrict NVFP4 edge-cases in llama-graph

Author

ORippler

Parents

a1824902

llama.cpp 02810c7a - Fix and restrict NVFP4 edge-cases in llama-graph (#24331)

llama.cpp
02810c7a - Fix and restrict NVFP4 edge-cases in llama-graph (#24331)