sync : ggml #3428

ggerganov merged 224 commits into master from sync-ggml-25-09-20
ggerganov
ggerganov cuda : remove obsolete sources (ggml/1332)
80447f74
mgiessing ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (ll…
2ce5860a
yeahdongcn musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (l…
02b49af9
YangShuai52 CANN: optimize rope operator (llama/15335)
2572322b
lhez opencl: mark `argsort` unsupported if cols exceed workgroup limit (ll…
db1d2380
yeahdongcn musa: fix build warnings (llama/15258)
0eb2d653
jeffbolznv vulkan: shorten pipeline name strings (llama/15431)
5907ab3e
JohannesGaessler CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433)
316ed78d
JohannesGaessler CUDA: refactor FA support/selection code (llama/15454)
8f0579a3
slaren sched : copy only the used experts when offloading prompt processing …
622dec5b
yeahdongcn musa: add GGML_UNUSED_VARS (llama/15446)
7c077845
ngxson ggml : fix condition of im2col on Metal backend (llama/15460)
c5874bcf
jeffbolznv vulkan: Reuse conversion results in prealloc_y (llama/15410)
04d0f9a0
ddwkim vulkan: add exp operation (llama/15456)
7eebd498
Acly vulkan : support conv_2d_dw with f16 weights (llama/15392)
9dd50399
slaren sched : fix possible use of wrong ids tensor when offloading moe prom…
554f96f3
noemotiovon CANN: Optimize RMS_NORM using cache (llama/15419)
be841c3f
taronaeo ggml-cpu: Support Q5_0 and Q5_1 on s390x (llama/15486)
380d3db2
YavorGIvanov cuda : add Pad Reflect 1D support (llama/14659)
18ca4e8f
rmatif ggml: add `conv3d` op (llama/15182)
d7b7498e
reeselevine ggml WebGPU: add support for quantization types (llama/15440)
bb5d7e2c
jeffbolznv vulkan: optimize mul_mat_id loading row ids into shared memory (llama…
485c5c3b
Acly vulkan : support ggml_mean (llama/15393)
5094171c
jeffbolznv vulkan: Rewrite synchronization to allow some overlap between nodes (…
d8eb9f7d
jeffbolznv vulkan: optimize rms_norm, and allow the work to spread across multip…
2f6288c3
JohannesGaessler CUDA: fix half2 -> half conversion for HIP (llama/15529)
b0d15e1e
jeffbolznv vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)
27817867
0cc4m vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/1…
8c7872d6
jeffbolznv vulkan: Support FA with any multiple of 8 head sizes (llama/15537)
85d4d2c8
0cc4m vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices…
ee11ed42
noemotiovon CANN: ROPE cache sin/cos repeat (llama/15501)
86331f74
ggerganov metal : add FA kernels for HS=40 (llama/15559)
54be54f4
JohannesGaessler CUDA: MoE helper in device code, better tile sizes (llama/15525)
1e856b29
booxter metal: fix regression when no metal devices are present (llama/15531)
8851ef54
0cc4m vulkan: fix min subgroup 16 condition for mmid subgroup optimization …
335d2a54
lhez opencl: fix support ops condition for `rms_norm` (llama/15560)
582ef379
Qeeweew CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (llama/15451)
2468074e
jeffbolznv vulkan: Remove splitting for mul_mat_id (llama/15568)
79e2bd5e
pt13762104 Add a warning for special devices (llama/15563)
9828caaf
CISC metal : remove contiguous assertion for src0 in IM2COL (llama/15577)
3bb52acb
ggerganov metal : improve `MUL_MAT_ID` (llama/15541)
dc693ca8
ggerganov metal : optimize FA vec for large sequences and BS <= 8 (llama/15566)
1c21a850
JohannesGaessler CUDA: return -1 for nonexistent compiled arch (llama/15587)
53010199
shalinib-ibm llamafile: PowerPC Sgemm Optimization (llama/15558)
31c7784e
qnixsynapse SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (…
94fa9f63
rmatif OpenCL: add fused group_norm/norm, mul, add (llama/15314)
a6ec224e
xctan ggml-cpu : add basic RVV support for vector f32 ops (llama/15057)
ece1bdfe
noemotiovon CANN: refactor mask handling and improve performance in FA (llama/15561)
02e8b231
IMbackK HIP: Enable support for ggml_backend_cuda_register_host_buffer (llama…
65fa2c0c
matiaslin cuda: Add cublasLt_static linking when GGML_STATIC is enabled (llama/…
88c0582b
ggerganov kv-cache : remove LLAMA_SET_ROWS checks (llama/15505)
cac62537
compilade ggml : fix SSM_SCAN for n_groups > 1 (llama/15625)
6dffbaa0
taronaeo ggml-cpu: fix invalid hsum build in debug s390x (llama/15634)
6287027a
mnehete32 CUDA: add conv2d (llama/15635)
dc9f55bb
am17an CUDA: fuse adds, fuse add with rms norm (llama/15631)
6d7ddaf7
am17an CUDA: fix bug in rms_norm fusion (llama/15660)
82ce91e7
noemotiovon CANN: FIx compiler warnings (llama/15661)
d629af15
jeffbolznv vulkan: Skip syncing for prealloc_y when it is reused (llama/15544)
a6dec4f4
JohannesGaessler CUDA: use FP32 arithmetic for conv2d (llama/15683)
b7809c40
JohannesGaessler llama: use FA + max. GPU layers by default (llama/15434)
f6ba3949
chaxu01 ggml: update kleidiai to v1.13.0 (llama/15663)
74583845
jeffbolznv vulkan: clamp matmul and FA results to the max finite value (llama/15…
71f0ee70
jeffbolznv vulkan: Allow fallback to sysmem memory when vidmem is full (llama/15…
20ce6fcf
danbev vulkan : remove unused portability_enumeration_ext variable (llama/15…
b092e95a
jeffbolznv vulkan: mul_mat_id coopmat2 optimizations (llama/15546)
191def71
jeffbolznv vulkan: handle large sizes for get_rows (llama/15686)
db7ecfb6
slaren llama : separate compute buffer reserve from fattn check (llama/15696)
b11c972b
ggerganov metal : fix checks for available FA kernels (llama/15700)
3d470687
hipudding CANN: fix RoPE cache issue on multi-device (llama/15629)
ed7ebdc7
hipudding CANN: Optimize MUL_MAT_ID (llama/15658)
bb5f844e
qnixsynapse CUDA: fix build error from ambiguous __half conversions in conv2d (ll…
2ba5e0cb
danbev ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops (llama/15695)
c5f511e6
0cc4m Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants …
5e70d901
Vithulep ggml: aarch64: Implement SVE F16 kernels for vector functions (llama/…
31840a3a
s-goto-11 ggml: SVE support for exponential functions (llama/15145)
8218dc60
0cc4m vulkan: disable large mmv subgroups on older Nvidia GPUs (llama/15717)
d5f80a29
jeffbolznv vulkan: add missing clamps in new mul_mat_id paths (llama/15702)
7a5e7368
giladgd vulkan: use memory budget extension to read memory usage (llama/15545)
9e3600e5
JohannesGaessler ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722)
f20a7b0e
hipudding CANN: Support ext_factor in rope (llama/15710)
13d3963f
noemotiovon CANN: Support eager execution mode under ACL graph compilation (llama…
3db49c1c
rmatif opencl: add attn sinks support for FA kernels (llama/15706)
fb37f911
jeffbolznv vulkan: Fix macro parameter order for f32 matmul shaders (llama/15716)
1e03aa66
hipudding CANN: Resolve soft_max precision issue (llama/15730)
5aee53c4
0cc4m vulkan: fix shaders gen when no integer dot is available (llama/15740)
e584edb5
noemotiovon CANN: Fix type float_t to float (llama/15736)
d84b96d9
hipudding CANN: Mask unsupported TRANSPOSE_1D operator (llama/15733)
91e9e72e
xctan ggml-cpu : optimize RVV kernels (llama/15720)
75f739c7
hipudding CANN: Add RoPE contiguous check for 310I DUP device (llama/15735)
51bc843f
ORippler CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1…
9eef3773
relent95 ggml vulkan: add hardsigmoid and hardswish operations (llama/15762)
85c7aa37
danbev vulkan : update ggml_vk_instance_validation_ext_available (llama/15666)
4144ae10
jeffbolznv vulkan: don't use std::string in load_shaders, to improve compile tim…
4a702a86
0cc4m vulkan: fix mmv subgroup16 selection (llama/15775)
719a05c6
noemotiovon CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (llama/15760)
5c860e94
rmatif opencl: add hs=40 to FA (llama/15758)
1569daf5
hipudding CANN: Fix precision issue on 310I DUO multi-devices (llama/15784)
96efb472
leejet ggml: add ops for WAN video model (cuda && cpu) (llama/15669)
2228462b
noemotiovon CANN: Refactor ND to NZ workspace to be per-device (llama/15763)
3780a3c9
gabe-l-hart metal : Add template specialization for mul_mm_id w/ ne20 == 10 (llam…
ffe560cb
gjasny CUDA : conditionally add cuda architectures (ggml/1341)
c80f78cc
danbev ggml : introduce semantic versioning (ggml/1336)
4d6e1144
JohannesGaessler CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (llama/15802)
6ff468cf
taronaeo ggml-cpu: drop support for nnpa intrinsics (llama/15821)
f499271c
JohannesGaessler ggml-cpu: document use of "free" memory [no ci] (llama/15834)
69400f16
chaxu01 kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16…
be2676bb
JohannesGaessler CUDA: faster tile FA (Pascal/AMD), headsize 256 (llama/15769)
cd70d896
danbev ggml WebGPU: remove userdata from request adapter callback (llama/15527)
cda7d4e5
jeffbolznv vulkan: Use larger loads in scalar/coopmat1 matmul (llama/15729)
647e2d7d
jeffbolznv vulkan: Support pad_ext (llama/15794)
9523fd8d
taronaeo ggml-cpu: clean up s390x SIMD (llama/15855)
db4f504b
jeffbolznv vulkan: support im2col_3d (llama/15795)
dfa7722e
noemotiovon CANN: Stream sync between devices for acl_graph (llama/15809)
d9c0ead2
CISC CUDA: non-contiguous src0 not supported for PAD (llama/15869)
0175a1df
ngxson ggml: allow casting between f32 and i32 (llama/15783)
40bcd1a4
ggerganov metal : refactor + optimize (llama/15857)
e9cb59e9
ggerganov cuda : fix supports_op condition for get_rows when number of blocks i…
ae6cc6a3
am17an CUDA: generate_cu_files.py - add missing mxfp4 (llama/15880)
70ee808f
jeffbolznv vulkan: sort graph to allow more parallel execution (llama/15850)
c29cd548
JohannesGaessler CUDA: fix GET_ROWS for large tensors (llama/15882)
26098223
am17an CUDA: Add mul_mat_id support for the mmf kernel (llama/15767)
621764b1
lksj92hs Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs …
7fbbb67b
JohannesGaessler HIP: use v_dot2_f32_f16 instruction for FA (llama/15884)
e35d1375
jeffbolznv vulkan: Fix OOB accesses in soft_max_back (llama/15861)
d0e98656
0cc4m vulkan: throw the oom error instead of no memory type found (llama/15…
7abe1878
noemotiovon CANN: implement LRU cache for ACL graphs (llama/15814)
9b773aca
noemotiovon CANN: Add ROPE sin/cos cache for reuse (llama/15912)
4d453b14
ggerganov sync : ggml
e2c7f1cc
ggerganov metal : make the backend async (llama/15906)
7eae055e
ggerganov sync : ggml
c974f630
danbev ggml-cpu : fix padding in ggml_timestep_embedding (llama/15917)
3617008c
ORippler CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (l…
f5ef0e25
hipudding CANN: Disable acl_graph for prefill stage (llama/15933)
dadf7366
chaxu01 kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed (llama/15614)
b079d9c8
danbev ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (llama/15922)
020eb19e
JohannesGaessler CUDA: larger SRAM reads for tile FA, AMD FP16 dot (llama/15927)
f0768eb5
slaren ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (llama/1…
555dcb3e
NeoZhangJianyu Revert "sycl: add usage of enqueue_functions extension (llama/14244)"…
cd764eaf
mbaudier vulkan: Make device memory check more portable (llama/15939)
5a752bab
0cc4m Vulkan iGPU device selection overhaul and PCI ID API support (llama/1…
424c85f2
taronaeo ggml-zdnn: fix #15414, activate FP16 and BF16 acceleration and incorr…
e902731c
ggerganov metal : fix memory leaks (llama/15962)
20a930ec
ggerganov metal : allow ops to run concurrently (llama/15929)
0d36ba9e
ggerganov metal : refactor kernel loading (llama/15964)
2caf15d6
jeffbolznv vulkan: initialize vulkan-hpp to allow using extension function point…
a3defb0a
jeffbolznv vulkan: fix failing dequant shaders (llama/15862)
1789ed3f
taronaeo ggml-zdnn: rm user mapped buffers (llama/15965)
7dca05ca
ggerganov metal : fix kernel requirements (llama/15983)
2d3f1560
0cc4m Vulkan: Clean up mul_mm shader (llama/15987)
c36358cb
ggerganov metal : remove memory pools (llama/15966)
82a8c141
am17an CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (llama/15926)
10bd5d36
yael-works SYCL: Add COUNT_EQUAL operator support (llama/15991)
a642b533
jakekarnes42 CUDA: fix im2col_3d to respect non-contiguous inputs (views) (llama/1…
f72ec185
danbev ggml : fix padding in timestep embedding kernels (llama/15932)
5c524bb8
noemotiovon CANN: Optimize ggml_cann_set_device (llama/15935)
e32c3b0f
netrunnereve vulkan: automatically remove unsupported devices (llama/15976)
e96b2850
JohannesGaessler CUDA: fix FA occupancy, optimize tile kernel (llama/15982)
d452f0cf
ggerganov sync : ggml
6458bac4
ggerganov metal : refactor + optimize v2 (llama/15995)
eb2c01f9
reeselevine GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (llam…
1361f679
noemotiovon CANN: Remove print (llama/16044)
c46adc08
ggerganov metal : handle nil cv during pipeline creation (llama/16065)
1f24b1df
jhen0409 metal : avoid call free for non-owned buffer (llama/16067)
32b6d9c1
ggerganov metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057)
d37f590a
CISC cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060)
225d7c1d
ggerganov metal : use function constants for mul_mv_ext kernels (llama/16074)
960aaa99
JohannesGaessler CUDA: fix compilation on CC 6.0 (llama/16091)
05bdfd43
bugparty CUDA: Optimize PAD_REFLECT_1D (llama/15957)
fce6354e
jeffbolznv rename optimize_graph to graph_optimize (llama/16082)
7fcb7e83
shawngu-quic opencl: optimize mxfp4 kernels (llama/16037)
f4a225ce
angt cmake : fix static linking for OpenMP on Unix-like systems (llama/16031)
4575f968
angt ggml-amx : fix ggml_amx_init() on generic Linux (llama/16049)
4d8cd078
ngxson ggml : refactor forward_dup for cpu backend (llama/16062)
2ad00d55
0cc4m vulkan: use vec dot for matrix matrix multiplications (llama/16056)
76d09342
ggerganov sync : ggml
66ad624d
ggerganov talk-llama : sync llama.cpp
36778bd8
ggerganov ggerganov force pushed from d94f2148 to 36778bd8 163 days ago
danbev
danbev approved these changes on 2025-09-21
ggerganov ggml : bump version to 0.9.1
d89164a0
ggerganov ggml : prepare for development of 0.9.2-dev
8d10ded0
jeffbolznv vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATIST…
9a6c2036
giuseppe vulkan: optimize UMA buffer operations and fix driver hangs (llama/16…
eae2be0c
lhez opencl: initial `q8_0` mv support (llama/15732)
0a7096f4
lhez opencl: fix concat crash on win arm64 with Adreno (llama/15944)
4b7f09ac
0cc4m vulkan: vec dot matrix multiplication fix (llama/16151)
95b29fab
jeffbolznv vulkan: add RTE variants of exp shader (llama/16165)
14723f25
relent95 Vulkan: add conv_transpose_2d operation (llama/16022)
9f673df0
ggerganov ggml : add ggml_op_is_empty (llama/16122)
973054a8
ggerganov ggml : extend ggml_can_fuse to work with non-sequential nodes (llama/…
df672c63
CISC ggml : implement set_rows with i32 index (llama/16159)
4e32ee73
danbev ggml-cpu : fix typo in gemm comments [no ci] (llama/16189)
d8d31e36
taronaeo zdnn: refactor codebase + add docs (llama/16178)
c706a507
CISC ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (lla…
73e8f3ac
wishstudio ggml-cpu: Respect cpumask settings (llama/16164)
41245891
Acly ggml : split graph allocations according to backend max buffer size (…
5069c080
JohannesGaessler llama: print memory breakdown on exit (llama/15860)
cd431223
rgerganov rpc : use ggml logging facilities
09466196
ggerganov metal : restore im2col perf (llama/16219)
0a5b811f
ggerganov metal : relax reorder conditions (llama/16216)
268f1c96
ggerganov metal : fuse NORM + MUL + ADD, support non-multiples of 4 (llama/16220)
ac678efb
JohannesGaessler llama: print memory breakdown on exit (llama/15860)
cd431223
am17an CUDA: add a fused top-K MoE kernel (llama/16130)
d9bf63cf
taronaeo ggml-cpu: implement MXFP4 SIMD for s390x (llama/16193)
89a7b4d2
angt common : use cpp-httplib as a cURL alternative for downloads (llama/1…
9823c5cc
ggerganov metal : report OOM errors (llama/16274)
670d54ef
taronaeo devops: add s390x & ppc64le CI (llama/15925)
23b35989
jeffbolznv vulkan: support GET_ROWS for k-quants (llama/16235)
97bd65f9
DmyMi vulkan: throw system error instead of SIGABRT during init on older de…
88dd9e0d
JohannesGaessler CUDA: refactor and deduplicate vector FA kernels (llama/16208)
e856483c
am17an CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 (l…
85e4455c
Acly vulkan : make the vulkan.hpp dynamic dispatcher instance private (lla…
bc1ac13c
jeffbolznv vulkan: support arbitrary KV dimension in flash attention (llama/16160)
eb982dd7
jeffbolznv vulkan: handle mat_mul with A matrix > 4GB (llama/16176)
91ab93b7
ggerganov metal : fuse non-sequential nodes (llama/16102)
45976f28
ggerganov metal : extend mat-mat multiplication support (llama/16225)
0102733c
jeffbolznv vulkan: 64-bit im2col (llama/16135)
55d45edf
CISC ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (llam…
5c6e7956
jeffbolznv vulkan: Fix validation failure in quantized flash attention (llama/16…
a375e4c4
ggerganov ggml : fix dependencies for ggml_set_rows (llama/16318)
7ce0a7bc
CISC ggml : check cuda and metal argsort limits and add test (llama/16323)
112e10f2
ggerganov cmake : remove metal flag (llama/0)
32013827
ggerganov sync : ggml
404a9311
ggerganov talk-llama : sync llama.cpp
fcf0181e
ggerganov ggerganov force pushed from aa10f77b to fcf0181e 154 days ago
ggerganov ggerganov merged fcf0181e into master 154 days ago
ggerganov ggerganov deleted the sync-ggml-25-09-20 branch 154 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone