PR #14363 llama : add high-throughput mode

llama : add high-throughput mode #14363

ggerganov merged 17 commits into master from gg/llama-high-throughput

github-actions added examples

github-actions added ggml

github-actions added Apple Metal

ggerganov force pushed to 1b74b9d7 345 days ago

ggerganov force pushed from 61795789 to dfceb012 337 days ago

Base automatically changed from gg/kv-cache-use-set-rows to master 336 days ago

ggerganov force pushed from dfceb012 336 days ago

ggerganov force pushed 336 days ago

ggerganov force pushed to dbcfcaae 336 days ago

compilade commented on 2025-07-03

ggerganov force pushed from dbcfcaae 335 days ago

ggerganov force pushed 335 days ago

ggerganov force pushed to 7b004292 335 days ago

ggerganov force pushed 335 days ago

ggerganov marked this pull request as ready for review 335 days ago

ggerganov force pushed to 5c00eb22 335 days ago

slaren commented on 2025-07-04

ggerganov force pushed 332 days ago

ggerganov force pushed to f23950a6 329 days ago

ggerganov force pushed from f23950a6 to ab82dc20 328 days ago

ggerganov requested a review from

JohannesGaessler 328 days ago

ggerganov added hot

github-actions added testing

github-actions added Nvidia GPU

kv-cache : prepare K/V buffers for separation

be82648b

batched-bench : fix oob write

5a354755

llama : add "virtual sequences"

45ecf841

llama : use "stream" vs "virtual sequence"

4c2d6510

graph : fix stream splitting when KV cache is not used

0d05acd6

kv-cache : add multi-stream save/load support

247015ee

llama : add "--attn-streams" flag

3354ce7e

kv-cache : fix handling when find_slot fails

18fb95dd

kv-cache : restore find_slot impl

cbe971ae

kv-cache : add comments

1b4fbc8f

kv-cache : add bounds checks for sequence id

8bf7fec0

cont : add n_seq_max to batch allocr

91751ead

kv-cache : perform stream copies lazily after llama_synchronize

2d08a395

kv-cache : avoid throwing exceptions across the C boundary

69169b15

CUDA: 4D FlashAttention support (#14628)

886d3f15

ggerganov force pushed to 886d3f15 327 days ago

llama : rename attn_streams -> kv_unified

fb8150d8

slaren approved these changes on 2025-07-16

common : rename kv_split -> kv_unified

318c4f8f

ggerganov merged 225e7a14 into master 323 days ago

ggerganov deleted the gg/llama-high-throughput branch 323 days ago

Reviewers

slaren

compilade

JohannesGaessler

Assignees

No one assigned

Labels

testing Nvidia GPU examples ggml Apple Metal hot

Milestone

No milestone

llama.cpp llama : add high-throughput mode #14363 Merged

llama : add high-throughput mode #14363

llama.cpp
llama : add high-throughput mode
#14363

Merged