PR #3428 sync : ggml - SemanticDiff

cuda : remove obsolete sources (ggml/1332)

80447f74

ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (ll…

2ce5860a

musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (l…

02b49af9

CANN: optimize rope operator (llama/15335)

2572322b

opencl: mark `argsort` unsupported if cols exceed workgroup limit (ll…

db1d2380

musa: fix build warnings (llama/15258)

0eb2d653

vulkan: shorten pipeline name strings (llama/15431)

5907ab3e

CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433)

316ed78d

CUDA: refactor FA support/selection code (llama/15454)

8f0579a3

sched : copy only the used experts when offloading prompt processing …

622dec5b

musa: add GGML_UNUSED_VARS (llama/15446)

7c077845

ggml : fix condition of im2col on Metal backend (llama/15460)

c5874bcf

vulkan: Reuse conversion results in prealloc_y (llama/15410)

04d0f9a0

vulkan: add exp operation (llama/15456)

7eebd498

vulkan : support conv_2d_dw with f16 weights (llama/15392)

9dd50399

sched : fix possible use of wrong ids tensor when offloading moe prom…

554f96f3

CANN: Optimize RMS_NORM using cache (llama/15419)

be841c3f

ggml-cpu: Support Q5_0 and Q5_1 on s390x (llama/15486)

380d3db2

cuda : add Pad Reflect 1D support (llama/14659)

18ca4e8f

ggml: add `conv3d` op (llama/15182)

d7b7498e

ggml WebGPU: add support for quantization types (llama/15440)

bb5d7e2c

vulkan: optimize mul_mat_id loading row ids into shared memory (llama…

485c5c3b

vulkan : support ggml_mean (llama/15393)

5094171c

vulkan: Rewrite synchronization to allow some overlap between nodes (…

d8eb9f7d

vulkan: optimize rms_norm, and allow the work to spread across multip…

2f6288c3

CUDA: fix half2 -> half conversion for HIP (llama/15529)

b0d15e1e

vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)

27817867

vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/1…

8c7872d6

vulkan: Support FA with any multiple of 8 head sizes (llama/15537)

85d4d2c8

vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices…

ee11ed42

CANN: ROPE cache sin/cos repeat (llama/15501)

86331f74

metal : add FA kernels for HS=40 (llama/15559)

54be54f4

CUDA: MoE helper in device code, better tile sizes (llama/15525)

1e856b29

metal: fix regression when no metal devices are present (llama/15531)

8851ef54

vulkan: fix min subgroup 16 condition for mmid subgroup optimization …

335d2a54

opencl: fix support ops condition for `rms_norm` (llama/15560)

582ef379

CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (llama/15451)

2468074e

vulkan: Remove splitting for mul_mat_id (llama/15568)

79e2bd5e

Add a warning for special devices (llama/15563)

9828caaf

metal : remove contiguous assertion for src0 in IM2COL (llama/15577)

3bb52acb

metal : improve `MUL_MAT_ID` (llama/15541)

dc693ca8

metal : optimize FA vec for large sequences and BS <= 8 (llama/15566)

1c21a850

CUDA: return -1 for nonexistent compiled arch (llama/15587)

53010199

llamafile: PowerPC Sgemm Optimization (llama/15558)

31c7784e

SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (…

94fa9f63

OpenCL: add fused group_norm/norm, mul, add (llama/15314)

a6ec224e

ggml-cpu : add basic RVV support for vector f32 ops (llama/15057)

ece1bdfe

CANN: refactor mask handling and improve performance in FA (llama/15561)

02e8b231

HIP: Enable support for ggml_backend_cuda_register_host_buffer (llama…

65fa2c0c

cuda: Add cublasLt_static linking when GGML_STATIC is enabled (llama/…

88c0582b

kv-cache : remove LLAMA_SET_ROWS checks (llama/15505)

cac62537

ggml : fix SSM_SCAN for n_groups > 1 (llama/15625)

6dffbaa0

ggml-cpu: fix invalid hsum build in debug s390x (llama/15634)

6287027a

CUDA: add conv2d (llama/15635)

dc9f55bb

CUDA: fuse adds, fuse add with rms norm (llama/15631)

6d7ddaf7

CUDA: fix bug in rms_norm fusion (llama/15660)

82ce91e7

CANN: FIx compiler warnings (llama/15661)

d629af15

vulkan: Skip syncing for prealloc_y when it is reused (llama/15544)

a6dec4f4

CUDA: use FP32 arithmetic for conv2d (llama/15683)

b7809c40

llama: use FA + max. GPU layers by default (llama/15434)

f6ba3949

ggml: update kleidiai to v1.13.0 (llama/15663)

74583845

vulkan: clamp matmul and FA results to the max finite value (llama/15…

71f0ee70

vulkan: Allow fallback to sysmem memory when vidmem is full (llama/15…

20ce6fcf

vulkan : remove unused portability_enumeration_ext variable (llama/15…

b092e95a

vulkan: mul_mat_id coopmat2 optimizations (llama/15546)

191def71

vulkan: handle large sizes for get_rows (llama/15686)

db7ecfb6

llama : separate compute buffer reserve from fattn check (llama/15696)

b11c972b

metal : fix checks for available FA kernels (llama/15700)

3d470687

CANN: fix RoPE cache issue on multi-device (llama/15629)

ed7ebdc7

CANN: Optimize MUL_MAT_ID (llama/15658)

bb5f844e

CUDA: fix build error from ambiguous __half conversions in conv2d (ll…

2ba5e0cb

ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops (llama/15695)

c5f511e6

Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants …

5e70d901

ggml: aarch64: Implement SVE F16 kernels for vector functions (llama/…

31840a3a

ggml: SVE support for exponential functions (llama/15145)

8218dc60

vulkan: disable large mmv subgroups on older Nvidia GPUs (llama/15717)

d5f80a29

vulkan: add missing clamps in new mul_mat_id paths (llama/15702)

7a5e7368

vulkan: use memory budget extension to read memory usage (llama/15545)

9e3600e5

ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722)

f20a7b0e

CANN: Support ext_factor in rope (llama/15710)

13d3963f

CANN: Support eager execution mode under ACL graph compilation (llama…

3db49c1c

opencl: add attn sinks support for FA kernels (llama/15706)

fb37f911

vulkan: Fix macro parameter order for f32 matmul shaders (llama/15716)

1e03aa66

CANN: Resolve soft_max precision issue (llama/15730)

5aee53c4

vulkan: fix shaders gen when no integer dot is available (llama/15740)

e584edb5

CANN: Fix type float_t to float (llama/15736)

d84b96d9

CANN: Mask unsupported TRANSPOSE_1D operator (llama/15733)

91e9e72e

ggml-cpu : optimize RVV kernels (llama/15720)

75f739c7

CANN: Add RoPE contiguous check for 310I DUP device (llama/15735)

51bc843f

CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1…

9eef3773

ggml vulkan: add hardsigmoid and hardswish operations (llama/15762)

85c7aa37

vulkan : update ggml_vk_instance_validation_ext_available (llama/15666)

4144ae10

vulkan: don't use std::string in load_shaders, to improve compile tim…

4a702a86

vulkan: fix mmv subgroup16 selection (llama/15775)

719a05c6

CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (llama/15760)

5c860e94

opencl: add hs=40 to FA (llama/15758)

1569daf5

CANN: Fix precision issue on 310I DUO multi-devices (llama/15784)

96efb472

ggml: add ops for WAN video model (cuda && cpu) (llama/15669)

2228462b

CANN: Refactor ND to NZ workspace to be per-device (llama/15763)

3780a3c9

metal : Add template specialization for mul_mm_id w/ ne20 == 10 (llam…

ffe560cb

CUDA : conditionally add cuda architectures (ggml/1341)

c80f78cc

ggml : introduce semantic versioning (ggml/1336)

4d6e1144

CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (llama/15802)

6ff468cf

ggml-cpu: drop support for nnpa intrinsics (llama/15821)

f499271c

ggml-cpu: document use of "free" memory [no ci] (llama/15834)

69400f16

kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16…

be2676bb

CUDA: faster tile FA (Pascal/AMD), headsize 256 (llama/15769)

cd70d896

ggml WebGPU: remove userdata from request adapter callback (llama/15527)

cda7d4e5

vulkan: Use larger loads in scalar/coopmat1 matmul (llama/15729)

647e2d7d

vulkan: Support pad_ext (llama/15794)

9523fd8d

ggml-cpu: clean up s390x SIMD (llama/15855)

db4f504b

vulkan: support im2col_3d (llama/15795)

dfa7722e

CANN: Stream sync between devices for acl_graph (llama/15809)

d9c0ead2

CUDA: non-contiguous src0 not supported for PAD (llama/15869)

0175a1df

ggml: allow casting between f32 and i32 (llama/15783)

40bcd1a4

metal : refactor + optimize (llama/15857)

e9cb59e9

cuda : fix supports_op condition for get_rows when number of blocks i…

ae6cc6a3

CUDA: generate_cu_files.py - add missing mxfp4 (llama/15880)

70ee808f

vulkan: sort graph to allow more parallel execution (llama/15850)

c29cd548

CUDA: fix GET_ROWS for large tensors (llama/15882)

26098223

CUDA: Add mul_mat_id support for the mmf kernel (llama/15767)

621764b1

Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs …

7fbbb67b

HIP: use v_dot2_f32_f16 instruction for FA (llama/15884)

e35d1375

vulkan: Fix OOB accesses in soft_max_back (llama/15861)

d0e98656

vulkan: throw the oom error instead of no memory type found (llama/15…

7abe1878

CANN: implement LRU cache for ACL graphs (llama/15814)

9b773aca

CANN: Add ROPE sin/cos cache for reuse (llama/15912)

4d453b14

sync : ggml

e2c7f1cc

metal : make the backend async (llama/15906)

7eae055e

sync : ggml

c974f630

ggml-cpu : fix padding in ggml_timestep_embedding (llama/15917)

3617008c

CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (l…

f5ef0e25

CANN: Disable acl_graph for prefill stage (llama/15933)

dadf7366

kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed (llama/15614)

b079d9c8

ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (llama/15922)

020eb19e

CUDA: larger SRAM reads for tile FA, AMD FP16 dot (llama/15927)

f0768eb5

ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (llama/1…

555dcb3e

Revert "sycl: add usage of enqueue_functions extension (llama/14244)"…

cd764eaf

vulkan: Make device memory check more portable (llama/15939)

5a752bab

Vulkan iGPU device selection overhaul and PCI ID API support (llama/1…

424c85f2

ggml-zdnn: fix #15414, activate FP16 and BF16 acceleration and incorr…

e902731c

metal : fix memory leaks (llama/15962)

20a930ec

metal : allow ops to run concurrently (llama/15929)

0d36ba9e

metal : refactor kernel loading (llama/15964)

2caf15d6

vulkan: initialize vulkan-hpp to allow using extension function point…

a3defb0a

vulkan: fix failing dequant shaders (llama/15862)

1789ed3f

ggml-zdnn: rm user mapped buffers (llama/15965)

7dca05ca

metal : fix kernel requirements (llama/15983)

2d3f1560

Vulkan: Clean up mul_mm shader (llama/15987)

c36358cb

metal : remove memory pools (llama/15966)

82a8c141

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (llama/15926)

10bd5d36

SYCL: Add COUNT_EQUAL operator support (llama/15991)

a642b533

CUDA: fix im2col_3d to respect non-contiguous inputs (views) (llama/1…

f72ec185

ggml : fix padding in timestep embedding kernels (llama/15932)

5c524bb8

CANN: Optimize ggml_cann_set_device (llama/15935)

e32c3b0f

vulkan: automatically remove unsupported devices (llama/15976)

e96b2850

CUDA: fix FA occupancy, optimize tile kernel (llama/15982)

d452f0cf

sync : ggml

6458bac4

metal : refactor + optimize v2 (llama/15995)

eb2c01f9

GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (llam…

1361f679

CANN: Remove print (llama/16044)

c46adc08

metal : handle nil cv during pipeline creation (llama/16065)

1f24b1df

metal : avoid call free for non-owned buffer (llama/16067)

32b6d9c1

metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057)

d37f590a

cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060)

225d7c1d

metal : use function constants for mul_mv_ext kernels (llama/16074)

960aaa99

CUDA: fix compilation on CC 6.0 (llama/16091)

05bdfd43

CUDA: Optimize PAD_REFLECT_1D (llama/15957)

fce6354e

rename optimize_graph to graph_optimize (llama/16082)

7fcb7e83

opencl: optimize mxfp4 kernels (llama/16037)

f4a225ce

cmake : fix static linking for OpenMP on Unix-like systems (llama/16031)

4575f968

ggml-amx : fix ggml_amx_init() on generic Linux (llama/16049)

4d8cd078

ggml : refactor forward_dup for cpu backend (llama/16062)

2ad00d55

vulkan: use vec dot for matrix matrix multiplications (llama/16056)

76d09342

sync : ggml

66ad624d

talk-llama : sync llama.cpp

36778bd8

ggerganov force pushed from d94f2148 to 36778bd8 163 days ago

danbev approved these changes on 2025-09-21

ggml : bump version to 0.9.1

d89164a0

ggml : prepare for development of 0.9.2-dev

8d10ded0

vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATIST…

9a6c2036

vulkan: optimize UMA buffer operations and fix driver hangs (llama/16…

eae2be0c

opencl: initial `q8_0` mv support (llama/15732)

0a7096f4

opencl: fix concat crash on win arm64 with Adreno (llama/15944)

4b7f09ac

vulkan: vec dot matrix multiplication fix (llama/16151)

95b29fab

vulkan: add RTE variants of exp shader (llama/16165)

14723f25

Vulkan: add conv_transpose_2d operation (llama/16022)

9f673df0

ggml : add ggml_op_is_empty (llama/16122)

973054a8

ggml : extend ggml_can_fuse to work with non-sequential nodes (llama/…

df672c63

ggml : implement set_rows with i32 index (llama/16159)

4e32ee73

ggml-cpu : fix typo in gemm comments [no ci] (llama/16189)

d8d31e36

zdnn: refactor codebase + add docs (llama/16178)

c706a507

ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (lla…

73e8f3ac

ggml-cpu: Respect cpumask settings (llama/16164)

41245891

ggml : split graph allocations according to backend max buffer size (…

5069c080

llama: print memory breakdown on exit (llama/15860)

cd431223

rpc : use ggml logging facilities

09466196

metal : restore im2col perf (llama/16219)

0a5b811f

metal : relax reorder conditions (llama/16216)

268f1c96

metal : fuse NORM + MUL + ADD, support non-multiples of 4 (llama/16220)

ac678efb

llama: print memory breakdown on exit (llama/15860)

cd431223

CUDA: add a fused top-K MoE kernel (llama/16130)

d9bf63cf

ggml-cpu: implement MXFP4 SIMD for s390x (llama/16193)

89a7b4d2

common : use cpp-httplib as a cURL alternative for downloads (llama/1…

9823c5cc

metal : report OOM errors (llama/16274)

670d54ef

devops: add s390x & ppc64le CI (llama/15925)

23b35989

vulkan: support GET_ROWS for k-quants (llama/16235)

97bd65f9

vulkan: throw system error instead of SIGABRT during init on older de…

88dd9e0d

CUDA: refactor and deduplicate vector FA kernels (llama/16208)

e856483c

CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 (l…

85e4455c

vulkan : make the vulkan.hpp dynamic dispatcher instance private (lla…

bc1ac13c

vulkan: support arbitrary KV dimension in flash attention (llama/16160)

eb982dd7

vulkan: handle mat_mul with A matrix > 4GB (llama/16176)

91ab93b7

metal : fuse non-sequential nodes (llama/16102)

45976f28

metal : extend mat-mat multiplication support (llama/16225)

0102733c

vulkan: 64-bit im2col (llama/16135)

55d45edf

ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (llam…

5c6e7956

vulkan: Fix validation failure in quantized flash attention (llama/16…

a375e4c4

ggml : fix dependencies for ggml_set_rows (llama/16318)

7ce0a7bc

ggml : check cuda and metal argsort limits and add test (llama/16323)

112e10f2

cmake : remove metal flag (llama/0)

32013827

sync : ggml

404a9311

talk-llama : sync llama.cpp

fcf0181e

ggerganov force pushed from aa10f77b to fcf0181e 154 days ago

ggerganov merged fcf0181e into master 154 days ago

ggerganov deleted the sync-ggml-25-09-20 branch 154 days ago

whisper.cpp
sync : ggml
#3428

Merged

sync : ggml #3428

whisper.cpp sync : ggml #3428 Merged

sync : ggml #3428

whisper.cpp
sync : ggml
#3428

Merged