transformers
GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity
#45977

Open

GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity #45977

ArthurZucker wants to merge 31 commits into main from gguf-matmul-kernels

ArthurZucker force pushed from 56d3847a to cb6ba169 37 days ago

Add GgufLinear: inference-time GGUF matmul on Apple Silicon

69d0f977

ArthurZucker force pushed from 5134799c to 69d0f977 33 days ago

ArthurZucker changed the base branch from update-gguf to main 33 days ago

doc

d75a23bc

GGUF cleanup — align with the FP8 quantizer pattern

5635106a

GGUF: target-aware GGUFDequantize drops the dense-Linear byte-copy

a23cae8c

GGUF: route MoE experts through the WeightConverter API too

bbb34dba

GGUF: register Mixtral / DeepSeek-V3 in MODEL_TYPE_TO_GGUF_EXPERTS

d4f6d40e

GGUF: GgufExperts matches MixtralExperts layout — merge converter jus…

1a820f9f

GGUF cleanup pass: drop modeling_utils side-path + fix MoE config att…

daa4d78c

gguf_kernels: drop snapshot_download fallback, use kernels.get_kernel…

6c08eda1

MODEL_TYPE_TO_GGUF_EXPERTS: sync with the MoE entries in _GGUF_ARCH_C…

c09fd9ce

MODEL_TYPE_TO_GGUF_EXPERTS: cover all MoE archs that quantize to GGUF

992843d2

GGUF: own its experts interface, drop entries from base ExpertsInterface

307aaab9

GGUF cleanup: drop bespoke helpers, mirror FP8 conventions tighter

c389d753

GGUF: explicit kernel refs on each module, drop bind helpers

ee2eef2c

GGUF: safetensors save round-trip via module_quant_types

a6c52296

9efcab00

cleanup

42c7e444

2e067f1b

updates

f56b072c

type

64067535

GGUF: per-arch rope/norm fixes + writable mmap + i-quant fallback

c43f20c6

Merge remote-tracking branch 'origin/main' into gguf-matmul-kernels

bdcfef8b

GGUF: lazy-import torch in quantizer_gguf to keep PIL-only CI happy

5ea30d9b

Merge branch 'main' into gguf-matmul-kernels

18cd7219

Merge branch 'main' of github.com:huggingface/transformers into gguf-…

57ba8842

Merge branch 'main' of github.com:huggingface/transformers into gguf-…

c76d77b9

GGUF: simplify quantizer plumbing and fix swap-plan rename bug

ccbcb1cf

GGUF: fix MPS dequant byte-corruption and make norm de-offset data-dr…

8d87b3c4

GGUF: de-offset norms via keep-in-fp32 instead of pre-applying

b11c4d7a

GGUF: expose header metadata without materializing tensors

e09ea879

GGUF: uniform meta-time swap, no rename, bind kernels post-load

e0aae61b

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

transformers GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity #45977 Open

GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity #45977

transformers
GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity
#45977

Open