Pull Requests ggerganov/llama.cpp

CUDA: Support CUDA Virtual Devices ggml CUDA

#25228 opened 2026-07-02 09:06 by anavp-nvidia

server : don't list cached models when a preset is used server

#25226 opened 2026-07-02 07:38 by angt

[SYCL] Flash Attention with XMX engine via oneDNN graph API (SDPA) on KV f16; Qwen3.6-27b-Q8_0 prefill speed up x1.21 at p=512 and x4.26 at p=80k ggml SYCL

#25222 opened 2026-07-02 06:26 by hmscider

common : add missing <fstream> include in common.h

#25220 opened 2026-07-02 05:39 by zhangrunda

sycl: add fused top-k MoE documentation ggml merge ready SYCL

#25217 opened 2026-07-02 03:55 by newjordan

hexagon: add VISION RoPE support ggml Hexagon

#25216 opened 2026-07-02 03:12 by aparmp-quic

server: add --no-sleep flag for GPU heartbeat on headless GPUs Vulkan server ggml SYCL CUDA

#25214 opened 2026-07-01 21:02 by johnkarlhill

llama: optimize RWKV7 inference by fusing some graph operators model testing Vulkan ggml SYCL Apple Metal CUDA

#25206 opened 2026-07-01 16:41 by MollySophia

sycl: add GGML_SYCL_FATTN_VEC_NTHREADS build option ggml SYCL

#25205 opened 2026-07-01 15:51 by Titaniumtown

llama: fix quantized kv-cache for dsv4 model

#25202 opened 2026-07-01 12:44 by am17an

llama-cli: fix passing chat_template_kwargs and reasoning_format params examples

#25201 opened 2026-07-01 12:41 by percontation

ggml-cpu: Enable tiled matmul on AIX ggml

#25199 opened 2026-07-01 12:17 by shalinib-ibm

vulkan: disable async transfer queue on amdvlk (mitigate MoE partial-offload crash) Vulkan ggml

#25196 opened 2026-07-01 08:05 by liminfei-amd

vulkan: Remove crash guard for Intel GPU Vulkan ggml

#25192 opened 2026-07-01 06:49 by rillomas

openvino: fix SWA mask detection for long prompts ggml OpenVINO

#25189 opened 2026-07-01 02:56 by zlma7001

CUDA/HIP: add Q2_0 (PrismML ternary 1.58-bit) support ggml CUDA

#25188 opened 2026-07-01 02:32 by The-Monk

spec: add backend sampling for DFlash

#25180 opened 2026-06-30 18:41 by ruixiang63

tests: Source-level separation between llama.cpp and ggml testing

#25179 opened 2026-06-30 18:33 by ckastner

metal: add col2im_1d op (f32/f16/bf16) ggml Apple Metal

#25176 opened 2026-06-30 17:44 by ServeurpersoCom

spec: add DSpark speculative decoding documentation model testing conversion

#25173 opened 2026-06-30 13:55 by wjinxu

grammar : recognize '|' at start of continuation line testing

#25170 opened 2026-06-30 11:23 by o7si

hexagon: allow dflash lm-head offload experiment model examples ggml Hexagon

#25166 opened 2026-06-30 09:43 by Salanfeng

Add support for Laguna XS.2 & M.1 model testing ggml CUDA conversion

#25165 opened 2026-06-30 09:31 by joerowell

ggml : fix wrong transpose function for int16 data ggml

#25161 opened 2026-06-30 07:13 by I3eg1nner

cuda: fix crash when querying memory on device with no free memory. ggml CUDA

#25157 opened 2026-06-30 03:43 by cphlipot

ggml: imatrix-aware NVFP4 quantization (scale search) + wire NVFP4 ftype examples ggml

#25153 opened 2026-06-30 00:59 by avifenesh

common, server : preserve HF file for cached models server

#25152 opened 2026-06-29 23:44 by mrexodia

CUDA: add COL2IM_1D op documentation ggml CUDA

#25151 opened 2026-06-29 22:53 by Ssamdeman

speculative: fix MTP draft crash on vision inputs

#25144 opened 2026-06-29 19:15 by ServeurpersoCom

llama : add position-relocatable KV range save/load testing

#25133 opened 2026-06-29 14:36 by Anyesh