transformers
ccbcb1cf - GGUF: simplify quantizer plumbing and fix swap-plan rename bug

Commit

3 days ago

GGUF: simplify quantizer plumbing and fix swap-plan rename bug - Drop GGUF save support (gguf_writer.py, save_gguf) — ship in a separate PR. - Remove metal_kernels_available probe; GgufLinear/GgufExperts now let ensure_metal_kernels() raise instead of swallowing into a dead module. - Move gguf_file= config construction into GgufQuantizeConfig.from_gguf_file and the disk-offload guard into GGUFQuantizer.validate_environment, so from_pretrained carries no GGUF special-case. - use_kernels defaults to False directly (no None dance). - Rename linear_mode -> keep_quantized, centralized in _resolve_keep_quantized. - Build the module-swap plan from GGUF header metadata (tensor_quant_types) instead of the materialized tensors. - Fix rename_source_key(prefix=...) -> base_model_prefix=...; the wrong kwarg raised TypeError for every tensor, silently caught, so the gguf_file swap plan was always empty (verified end-to-end: 169 GgufLinear modules now swap and generate correctly on MPS). Remove the masking try/except. - Single-regex skip in replace_with_gguf_linear; use model.set_submodule. - Add CPU coverage for the metadata-driven swap plan. SubtractOne stays a no-op pass-through with the de-offset pre-applied in load_checkpoint_state: slow weights-conversion tests confirm the loader casts to the target dtype before the converter chain, so the fp32 subtraction must happen up-front.

References

#45977 - GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity

Author

ArthurZucker

Committer

ArthurZucker

Parents

c76d77b9

transformers ccbcb1cf - GGUF: simplify quantizer plumbing and fix swap-plan rename bug

transformers
ccbcb1cf - GGUF: simplify quantizer plumbing and fix swap-plan rename bug