ggml
sync : llama.cpp
#1459
Merged

sync : llama.cpp #1459

ggerganov merged 27 commits into master from sync-llama.cpp-26-04-09
ggerganov
arthw sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (llama/21283)
df72261d
ArberSephirotheca ggml-webgpu: add vectorized flash attention (llama/20709)
e68c2916
rgerganov rpc : reuse compute graph buffers (llama/21299)
f24ad1e4
z-vishal ggml-zendnn : add MUL_MAT_ID op support for MoE models (llama/21315)
c483ff67
reeselevine ggml-webgpu: move from parameter buffer pool to single buffer with of…
44dc697a
YardenTal44 hexagon: slight optimization for argosrt output init (llama/21463)
799db709
arthw sycl : handle other FA case (llama/21377)
88b4f515
gaugarg-nv Write an optimized flash_attn_stream_k_fixup kernel (llama/21159)
efd3b513
khosravipasha ggml: add Q1_0 1-bit quantization support (CPU) (llama/21273)
b8621039
yomaytk ggml-webgpu: Add the support of `MUL_MAT_ID` (llama/21147)
2f4c4462
PMZFX Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (llama/21…
983da6a6
aviallon ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) …
3f81977b
mkoker vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (llama/21029)
dc2f14ad
tomoverlund ggml: Vulkan build, Linux -- output error string for errno on fork fa…
7ee5f7c9
ggerganov ggml : deprecate GGML_OP_ADD1 (llama/21363)
6bf0eec7
am17an CUDA: check for buffer overlap before fusing (llama/21566)
f2a19d4a
reeselevine ggml-webgpu: parameterize submission size and add iOS specific limits…
fa761cd3
iacopPBK ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (llama/21168)
ac52cf58
am17an CUDA: make cuda graphs props check faster (llama/21472)
c3bf23a2
khosravipasha metal: Q1_0 backend (llama/21528)
4969090e
reeselevine webgpu : Query for adapter support when registering WebGPU backend (l…
07dc4a68
RealOrko fix: free ctx_copy in ggml_opt_free to plug per-training-session leak…
c9e99715
am17an CUDA: also store `node->src->data` ptrs for equality check (llama/21635)
607fbdfc
0cc4m vulkan: unify type macros to use Vx instead of _VECx (llama/21605)
0fe04584
qnixsynapse sycl : add flash-attn support for head size 512 (llama/21654)
ccb3f958
ggerganov metal : add missing mm-id specializations for q1_0 (llama/21662)
677816c5
ggerganov sync : llama.cpp
5c1d8e02
danbev
danbev approved these changes on 2026-04-09
ggerganov ggerganov merged 58c38058 into master 81 days ago
ggerganov ggerganov deleted the sync-llama.cpp-26-04-09 branch 81 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone