PR #3636 sync : ggml - SemanticDiff

HIP: add fattn-mma-f16 for RDNA4 (llama/18481)

e94d11aa

ggml-metal: do not copy headers for embedded, use current binary dir …

241d378e

vulkan: work around Intel fp16 bug in mmq (llama/18814)

d1549657

CUDA : fix typo in clang pragma comment [no ci] (llama/18830)

848a34ae

vulkan: Check maxStorageBufferRange in supports_op (llama/18709)

a4c04139

CUDA: Factor out and re-use `block_reduce` function (llama/18785)

ed8e1b9c

hexagon: support for OP_CPY, host buffers now optional (llama/18822)

a49e6887

ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837)

b264175e

CUDA: fix allignment on register spill for FA (llama/18815)

0ed573e2

cuda : print less debug logs when disabling cuda graphs (llama/18868)

da3b1b73

OpenCL: add SOLVE_TRI op support (llama/18846)

3aa5e7ec

CANN: support gated linear attn (llama/18653)

3355e081

CANN: fix an issue where get_env was not fully renamed (llama/18796)

b59fcb02

CANN: Remove unused `ggml_cann_get_device` function (llama/18625)

60497ff4

ggml-blas: hide warnings from included BLAS headers (llama/18818)

a5fab81a

ggml : extend ggml_pool_1d + metal (llama/16429)

cdf61be2

ggml webgpu: support for backend sampling (llama/18880)

391bdb90

opencl: fix q6_K mv for m=1 (llama/18893)

758c3c44

ggml : add ggml_build_forward_select (llama/18550)

867ca7da

metal : enable FA for MLA heads (llama/18950)

a987d92e

ggml : cleanup path_str() (llama/18928)

6d6f9866

CUDA: Replace init_offsets kernel with iterators in cub-based argsort…

cb6577c4

CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator…

afe234b9

vulkan: Use mul_mat_vec_id for small values of n (llama/18918)

85e51c0b

Revert "vulkan: force full subgroups for flash attention to fix intel…

cf88fb0a

vulkan: support flash attention GQA/split_k with small batches (llama…

f1c37d25

vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945)

28016940

ggml-zdnn : mark zDNN buffers as non-host (llama/18967)

6863e60a

opencl: add TRI op support (llama/18979)

3e1b8360

CUDA: add gqa_ratio 4 for GLM 4.7 flash (llama/18953)

c511b403

opencl: enable the general fp mm for non-cont input and as a fallback…

fffe9d00

CUDA: fix alignment check for FA (llama/19023)

e2d86f7b

mla : make the V tensor a view of K (llama/18986)

ff0f90f9

ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementat…

1708c1e1

use malloc to support both iGPU and dGPU in same time (llama/18992)

29ba416e

ggml-hexagon: flash-attn opt (llama/19025)

af0790bf

ggml-cuda: enable cuda-graphs for `n-cpu-moe` (llama/18934)

4abd7e2b

CUDA: re-use MLA K data for V in MMA FA (llama/19057)

b52f20c2

kv-cache : support V-less cache (llama/19067)

83a0a997

ggml-cpu: Use tiled FA for prompt-processing (llama/19012)

72d9d3e9

metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/m…

024f396d

CUDA: faster FA for GQA > 1 but not power of 2 (llama/19092)

cde57edc

CUDA: fix padding of GQA to power of 2 in FA (llama/19115)

247e5234

opencl: add flattened q6_K mv (llama/19054)

8e3b59de

ggml-cpu: Enable FP16 MMA kernels on PPC (llama/19060)

861dc273

Reduce CPU-side stalls due to the CUDA command buffer being full (lla…

a415aa4d

ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementat…

91777092

CUDA: tune GLM 4.7 Flash FA kernel selection logic (llama/19097)

5cc23d5b

ggml-zendnn : update ZenDNN git tag to main branch (llama/19133)

25fbee9c

ggml webgpu: Split shared state (webgpu_context) into global state an…

efc07fbe

CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (llama…

4a227b50

cuda : fix "V is K view" check for non-unified KV cache (llama/19145)

e80c5899

ggml-cpu: arm64: Q4_K scale unroll and vectorization (llama/19108)

6551c7a0

ggml: new backend for Virglrenderer API Remoting acceleration (v2) (l…

ca91a430

vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058)

389a7e22

ggml-sycl: remove unused syclcompat header (llama/19140)

2e291680

Vulkan Flash Attention Coopmat1 Refactor (llama/19075)

a3958464

sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove asser…

0a7ea4b3

CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.…

71627ccd

ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (…

7bed9d18

HIP: add mmf for CDNA (llama/18896)

e5891014

cuda : fix nkvo, offload and cuda graph node properties matching (lla…

38bb4d64

hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama…

99076001

ggml-webgpu: improve flastAttention performance by software pipelinin…

6cf5828d

sycl: implement GGML_OP_TRI (llama/19089)

2466cf11

sycl: implement GGML_UNARY_OP_SOFTPLUS (llama/19114)

c5003e9b

add tensor type checking as part of cuda graph properties (llama/19186)

33b2ffca

sync : ggml

859f583d

talk-llama : sync llama.cpp

54d3fbc1

danbev approved these changes on 2026-01-30

cuda : fix compile warnings (#0)

ced24648

ggerganov merged acbace05 into master 3 days ago

ggerganov deleted the sync-ggml-26-01-30 branch 3 days ago

whisper.cpp
sync : ggml
#3636

Merged

sync : ggml #3636

whisper.cpp sync : ggml #3636 Merged

sync : ggml #3636

whisper.cpp
sync : ggml
#3636

Merged