llama.cpp
ggml, llama : add KV cache size limiting and block tracking infrastructure
#18747
Open

ggml, llama : add KV cache size limiting and block tracking infrastructure #18747

pestopoppa wants to merge 17 commits into ggml-org:master from pestopoppa:feature/paged-attention
pestopoppa
pestopoppa feat: add --moe-n-expert flag for MoE expert count override (Hard Mask)
553b6dce
pestopoppa feat: add layer skip / early exit support for speculative decoding
b5e11afb
pestopoppa feat: add layer skip support for qwen3vl-moe and qwen3next
42e7d627
pestopoppa lookahead: fix n_seq_max and kv_unified configuration
7bf427dc
pestopoppa lookup, lookahead: fix crash when n_ctx not specified
2a16c438
pestopoppa ggml-cpu: parallelize tensor repacking with OpenMP
2ee7aa7e
pestopoppa docs: add branch management rules to prevent build issues
e3053631
pestopoppa kv-cache : optimize SWA slot reuse with forward-looking masking
394e0cb3
pestopoppa kv-cache: fix SWA cell reuse to ensure mathematical correctness
6b43356a
pestopoppa feat: implement CPU paged attention for flash attention
de4f93c9
pestopoppa feat: implement dynamic block allocation for paged attention
c0ca18b7
pestopoppa feat: add block pool statistics for debugging paged attention
eb40d730
pestopoppa feat: add KV cache memory reduction for paged attention
b14fe3bf
pestopoppa test: add unit tests for block pool and table
e14387ae
pestopoppa feat: add CLI flags for paged attention
9db451ee
pestopoppa refactor: trim verbose comments in llama-kv-block.h
0b633c35
pestopoppa pestopoppa requested a review from ggerganov ggerganov 2 days ago
pestopoppa pestopoppa requested a review from CISC CISC 2 days ago
pestopoppa pestopoppa requested a review from JohannesGaessler JohannesGaessler 2 days ago
github-actions github-actions added model
github-actions github-actions added testing
github-actions github-actions added examples
github-actions github-actions added ggml
JohannesGaessler
ngxson
pestopoppa
JohannesGaessler
ngxson
ngxson
qnixsynapse
ngxson
pestopoppa pestopoppa changed the title ggml, llama : add CPU paged attention for memory-efficient KV cache ggml, llama : add KV cache size limiting and block tracking infrastructure 1 day ago
pestopoppa refactor: remove unrelated changes from KV cache PR
d98013d1
pestopoppa
ngxson

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone