sync : ggml #3710

ggerganov merged 91 commits into master from sync-ggml-26-03-16
ggerganov
am17an ggml-cpu: add repack for mxfp4 (llama/19738)
8f3aa2fd
Jayluci4 CUDA: add CDNA3 MFMA support for flash attention MMA kernel (llama/19…
50646889
oobabooga cuda: cap grid.y at 65535 in non-contiguous dequantize/convert kernel…
681debff
0cc4m vulkan: improve partial offloading performance on AMD (llama/19976)
91e9af83
taronaeo ggml-cpu: optimise s390x multiply extend instructions (llama/20032)
1810a80b
0cc4m vulkan: tune MMVQ for Intel Windows (llama/19988)
47b12eae
yomaytk ggml-webgpu: Support non-contiguous `src0` and overlapping `src0/src1…
84e45a60
nikhilJain17 ggml webgpu: Clean up per-thread parameter buffer pool and job submis…
8be81d37
abhijitramesh ggml webgpu: fix workgroup dispatch limit for large batch sizes (llam…
a444c8a0
shaofeiqi opencl: add optimized q4_1 mm kernel for adreno (llama/19840)
d89fc23b
chaxu01 kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (llama/…
2eeb5e3a
angt ggml : use a simple std::thread in AMX without OpenMP (llama/20074)
1a9b0f9f
JohannesGaessler ggml: fix ggml_is_contiguous_n for ne == 1 (llama/20092)
ffe593bb
yomaytk Add concat op to webgpu. (llama/20068)
c456e26e
nikhilJain17 Fix wait logic for inflight jobs (llama/20096)
6ae853c3
lhez opencl: add `SET`, support i32 for `CPY`, minor refactor for cpy (lla…
38cc52c1
max-krasnyansky hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and M…
0964d663
marcelpetrick chore : correct typos [no ci] (llama/20041)
484163ec
aendk CUDA: Improve performance via less synchronizations between token (ll…
22502c04
YardenTal44 hexagon: add fp16 support for binary ops: add,sub,mul,div (llama/20139)
15ff3f5b
lhez opencl: add neg, exp and diag (llama/20127)
dc6b7229
JohannesGaessler ggml-cpu: fix data race for debug asserts (llama/20148)
a27e51cb
am17an CUDA: use shared mem for ssm_conv (llama/20128)
cbd9e948
shalinib-ibm ggml-cpu: Fix gcc 15 ICE on ppc64le (ggml/20083) (llama/20130)
96fb6151
taronaeo ggml: update comments for backends which have no memory to report (ll…
7f38f2a2
am17an ggml-cuda: add mem check for fusion (llama/19916)
6e34b302
max-krasnyansky cpu: skip redudant ROPE cache updates (llama/20149)
9bfd03b9
tboinovski1 hexagon: add f32 ssm_conv op (llama/20122)
d5ea0591
bartowski1182 quants : Add memsets and other fixes for IQ quants (llama/19861)
94de6807
lhez opencl: add l2_norm (llama/20160)
78531aa0
am17an ggml: add GATED_DELTA_NET op (llama/19504)
bc20d1ab
arthw supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (llama/20190)
b61009bb
jeffbolznv vulkan: Fix data races in coopmat1 mul_mat(_id) (llama/20084)
df65a360
GiantPrince ggml-vulkan: Add ELU op support (llama/20183)
18a8e3d5
tehsiuhuang cuda : display total and free VRAM capacity during device initializat…
47ba98e1
0cc4m vulkan: skip zero size tensors in backend copies (llama/20233)
e8501ded
bertaye ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (l…
f328ac96
am17an ggml-cuda: disable gdn for musa (llama/20278)
e307d934
ggerganov metal : add upscale (llama/20284)
dac5a06a
arkavo-com metal : extend mul_mv_ext to BF16, Q2_K, Q3_K (llama/20250)
6a1de6f8
JulianPscheid metal: handle command buffer failures gracefully in synchronize (llam…
d2a96485
taimur-10x ggml-cpu: add RVV repack GEMM and GEMV for quantization types (llama/…
ef15437e
chaxu01 kleidiai : support for concurrent sme and neon kernel execution (llam…
e155c541
reeselevine ggml webgpu: faster normal quant and some k-quant matrix operations, …
bf2d45a6
ggerganov ggml : bump RPC version (llama/20330)
f1265f13
arthw fix for failed UT case: ACC, L2_NORM, UPSCALE, fused_glu, unary (llam…
ec414e9d
arthw fix op rope, add rope_back (llama/20293)
e3db9561
IMbackK cuda/hip: fix loop unrolling in ssm-conv (llama/20369)
ff5fa922
IMbackK ggml-cuda: gdn use shared mem for HIP (llama/20366)
42df97a1
ggerganov metal : add env var to trigger graph capture (llama/20398)
80655c9a
ggerganov metal : fix q5_k mul_mv register spill (llama/20399)
dafed294
ggerganov metal : fix capture_compute counter logic (llama/20410)
ed7718a3
danbev llama : add support for Nemotron 3 Super (llama/20411)
e19ec84c
richarddd ggml : add NVFP4 quantization type support (llama/19769)
e56a3c19
ggerganov llama : enable chunked fused GDN path (llama/20340)
9f32749f
yomaytk ggml-webgpu: Add supports for `GGML_OP_REPEAT` (llama/20230)
0ec906b9
IMbackK hip: compile debug builds with -O2 on hip to avoid a compiler bug (ll…
20e71614
shaofeiqi opencl: add cumsum op (llama/18981)
c8fc662f
lhez opencl: use larger workgroup size for get_rows (llama/20316)
b14f9f58
rillomas vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large mode…
00917d94
jeffbolznv vulkan: fix OOB check in flash_attn_mask_opt (llama/20296)
812c45ee
jeffbolznv vulkan: fix l2_norm epsilon handling (llama/20350)
8ca1798d
ggerganov sync : ggml
570e1466
ggerganov metal : avoid divisions in bin kernel (llama/20426)
e18c6d28
ggerganov sync : ggml
606ee599
ProgenyAlpha vulkan: fix SSM_CONV PP scaling with large ubatch sizes (llama/20379)
e057ff00
ProgenyAlpha vulkan: add GATED_DELTA_NET op support (llama/20334)
f25cbd60
ggerganov llama : disable graph reuse with pipeline parallelism (llama/20463)
f5025aec
ggerganov metal : fix l2 norm scale (llama/20493)
8a33e87a
angt ggml : fix typo gmml (llama/20512)
7f44d6b7
rehan-10xengineer ggml-cpu: add RVV vec dot kernels for quantization types (llama/18859)
5022e7eb
ggerganov graph : remove redundant GDN state transposes (llama/20443)
97854bea
lhez opencl: fix l2_norm (llama/20480)
deaa2db4
Exile333 Fix data race in CUDA's "cpy" kernel (influences GGML's DUP, CONT ope…
bdc9bac0
wine99 ggml : add OpenVINO backend (llama/15307)
7a14c004
wallentri88 Use fp32 in cuBLAS V100 to avoid overflows, env variables to override…
cfa02a01
angt ggml : add native AVX512-FP16 support for F16 operations (llama/20529)
5389551b
arthw add op gated_delta_net (llama/20455)
b3106ec5
max-krasnyansky hexagon: Q4_0 and MXFP4 repack fixes (llama/20527)
21773e33
ggerganov metal : add FA specialization for HSK = 320, HSV = 256 (llama/20549)
d752d737
0cc4m vulkan: use graphics queue on AMD (llama/20551)
c43c45c5
JoursBleu cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode (llama…
f9cd8348
bartowski1182 ggml : guard against sumq2 being 0 in IQ4_NL (llama/20460)
25a75674
moonshadow-25 ggml/hip: fix APU compatibility - soft error handling for hipMemAdvis…
603bc50d
ServeurpersoCom ggml: avoid creating CUDA context during device init (llama/20595)
d79032c9
JohannesGaessler CUDA: limit number of FA stream-k CUDA blocks (llama/20586)
491d5129
ggerganov common : add nvfp4 (ggml/0)
4ebc6f58
David366AI ggml : extend im2col f16 (ggml/1434)
83fed291
ggerganov sync : ggml
2f41a39b
ggerganov talk-llama : sync llama.cpp
678c2b42
danbev
danbev approved these changes on 2026-03-16
ggerganov ggml : try fix arm build (#0)
ae853f46
ggerganov ggerganov merged 27fa2077 into master 73 days ago
ggerganov ggerganov deleted the sync-ggml-26-03-16 branch 73 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone