GGUF: simplify quantizer plumbing and fix swap-plan rename bug
- Drop GGUF save support (gguf_writer.py, save_gguf) — ship in a separate PR.
- Remove metal_kernels_available probe; GgufLinear/GgufExperts now let
ensure_metal_kernels() raise instead of swallowing into a dead module.
- Move gguf_file= config construction into GgufQuantizeConfig.from_gguf_file
and the disk-offload guard into GGUFQuantizer.validate_environment, so
from_pretrained carries no GGUF special-case.
- use_kernels defaults to False directly (no None dance).
- Rename linear_mode -> keep_quantized, centralized in _resolve_keep_quantized.
- Build the module-swap plan from GGUF header metadata (tensor_quant_types)
instead of the materialized tensors.
- Fix rename_source_key(prefix=...) -> base_model_prefix=...; the wrong kwarg
raised TypeError for every tensor, silently caught, so the gguf_file swap
plan was always empty (verified end-to-end: 169 GgufLinear modules now swap
and generate correctly on MPS). Remove the masking try/except.
- Single-regex skip in replace_with_gguf_linear; use model.set_submodule.
- Add CPU coverage for the metadata-driven swap plan.
SubtractOne stays a no-op pass-through with the de-offset pre-applied in
load_checkpoint_state: slow weights-conversion tests confirm the loader casts
to the target dtype before the converter chain, so the fp32 subtraction must
happen up-front.