Pull Requests ggerganov/llama.cpp

mtmd : rename mtmd_get_audio_bitrate to mtmd_get_audio_sample_rate examples

#20105 opened 2026-03-04 11:39 by danbev

opencl: add `set`, i32 for `cpy` ggml OpenCL

#20101 opened 2026-03-04 07:39 by lhez

[WebGPU] Fix wait logic for inflight jobs devops ggml

#20096 opened 2026-03-04 03:11 by nikhilJain17

hexagon: add llama-completion windows runner script script

#20095 opened 2026-03-04 01:15 by tboinovski1

opencl: add q6_K gemm and gemv kernels for Adreno ggml OpenCL

#20089 opened 2026-03-03 19:30 by lhez

server: Add OpenRouter-compatible reasoning API examples server

#20088 opened 2026-03-03 18:40 by roj234

Hybrid model cache: add `--checkpoint-every-nb` examples server

#20087 opened 2026-03-03 18:27 by pwilkin

llama : add attention weights extraction API [EXPERIMENTAL] examples python

#20086 opened 2026-03-03 17:12 by QuentinFuxa

vulkan: Fix data races in coopmat1 mul_mat(_id) Vulkan ggml

#20084 opened 2026-03-03 16:50 by jeffbolznv

CUDA: Add BF16 path to CUBLAS and increase precision of FP16 path Nvidia GPU ggml

#20078 opened 2026-03-03 16:02 by ORippler

fix: correct EXAONE3 FFN_DOWN tensor mapping prefix python

#20076 opened 2026-03-03 15:47 by Bias92

fix: speculative decoding broken on hybrid SSM/MoE (Qwen3.5 MoE)

#20075 opened 2026-03-03 14:57 by eauchs

vendor : update cpp-httplib to 0.36.0 script python

#20073 opened 2026-03-03 14:08 by cabelo

kleidiai : support for concurrent sme and neon kernel execution documentation ggml

#20070 opened 2026-03-03 12:34 by chaxu01

cli: add /think command to toggle reasoning examples

#20069 opened 2026-03-03 12:03 by roj234

ggml-webgpu: Add the support of `GGML_OP_CONCAT` documentation ggml

#20068 opened 2026-03-03 11:54 by yomaytk

cli: Don't clear system prompt when using '/clear' examples

#20067 opened 2026-03-03 11:30 by roj234

webui: Improvements for Models Selector UI examples server

#20066 opened 2026-03-03 11:09 by allozaur

cmake: fix ARM feature detection hang on platforms without SVE/SME ggml

#20064 opened 2026-03-03 10:36 by mbucko

llama: parallel model loading across GPU contexts

#20062 opened 2026-03-03 09:47 by mxxm-t

ggml : add NVFP4 quantization type support for metal testing python ggml Apple Metal

#20060 opened 2026-03-03 08:10 by richarddd

vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap Vulkan ggml

#20059 opened 2026-03-03 07:56 by rillomas

server: fix infinite retry loop when KV cache is full examples server

#20050 opened 2026-03-02 23:02 by ssam18

fix(docs): correct typos found during code review documentation model script testing Nvidia GPU Vulkan examples python server ggml SYCL Apple Metal Ascend NPU OpenCL jinja parser

#20041 opened 2026-03-02 14:46 by marcelpetrick

contributing: limit open PRs for new contributors to 1

#20036 opened 2026-03-02 07:37 by am17an

cann: support flash attention for head dim not multiple of 16 ggml Ascend NPU

#20031 opened 2026-03-02 02:40 by noemotiovon

vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling testing Vulkan ggml

#20018 opened 2026-03-01 20:10 by neilopet

vulkan: add sparse OOM fallback for large UMA allocations and chunked staging fallback testing Vulkan ggml

#20017 opened 2026-03-01 20:02 by neilopet

feat: add --cache-only flag to skip model re-download

#20010 opened 2026-03-01 14:43 by lonnie08

server: add Qwen3-Reranker instruction support examples python server

#20009 opened 2026-03-01 14:15 by schwebke