llama.cpp
ggml, llama : add KV cache size limiting and block tracking infrastructure
#18747

Open

ggml, llama : add KV cache size limiting and block tracking infrastructure #18747

pestopoppa wants to merge 17 commits into ggml-org:master from pestopoppa:feature/paged-attention

feat: add --moe-n-expert flag for MoE expert count override (Hard Mask)

553b6dce

feat: add layer skip / early exit support for speculative decoding

b5e11afb

feat: add layer skip support for qwen3vl-moe and qwen3next

42e7d627

lookahead: fix n_seq_max and kv_unified configuration

7bf427dc

lookup, lookahead: fix crash when n_ctx not specified

2a16c438

ggml-cpu: parallelize tensor repacking with OpenMP

2ee7aa7e

docs: add branch management rules to prevent build issues

e3053631

kv-cache : optimize SWA slot reuse with forward-looking masking

394e0cb3

kv-cache: fix SWA cell reuse to ensure mathematical correctness

6b43356a

feat: implement CPU paged attention for flash attention

de4f93c9

feat: implement dynamic block allocation for paged attention

c0ca18b7

feat: add block pool statistics for debugging paged attention

eb40d730

feat: add KV cache memory reduction for paged attention

b14fe3bf

test: add unit tests for block pool and table

e14387ae

feat: add CLI flags for paged attention

9db451ee

refactor: trim verbose comments in llama-kv-block.h

0b633c35

pestopoppa requested a review from

ggerganov 2 days ago

pestopoppa requested a review from

CISC 2 days ago

pestopoppa requested a review from

JohannesGaessler 2 days ago

github-actions added model

github-actions added testing

github-actions added examples

github-actions added ggml

pestopoppa changed the title ~~ggml, llama : add CPU paged attention for memory-efficient KV cache~~ ggml, llama : add KV cache size limiting and block tracking infrastructure 1 day ago

refactor: remove unrelated changes from KV cache PR

d98013d1

Reviewers

ggerganov

CISC

JohannesGaessler

Assignees

No one assigned

Labels

model testing examples ggml

Milestone

No milestone

llama.cpp ggml, llama : add KV cache size limiting and block tracking infrastructure #18747 Open

ggml, llama : add KV cache size limiting and block tracking infrastructure #18747

llama.cpp
ggml, llama : add KV cache size limiting and block tracking infrastructure
#18747

Open