PR #14363 llama : add high-throughput mode

llama : add high-throughput mode #14363

ggerganov merged 17 commits into master from gg/llama-high-throughput

github-actions added examples

github-actions added ggml

github-actions added Apple Metal

ggerganov force pushed from ab2a2bb1 to 1b74b9d7 78 days ago

ggerganov force pushed from 61795789 to dfceb012 70 days ago

Base automatically changed from gg/kv-cache-use-set-rows to master 69 days ago

ggerganov force pushed from dfceb012 to eb5856cd 69 days ago

ggerganov force pushed from eb5856cd to ee0f729a 69 days ago

ggerganov force pushed from ee0f729a to deae7cda 69 days ago

ggerganov force pushed from deae7cda to 988d0cd8 69 days ago

ggerganov force pushed from 988d0cd8 to dbcfcaae 69 days ago

compilade commented on 2025-07-03

ggerganov force pushed from dbcfcaae to 33dcc3c9 68 days ago

ggerganov force pushed from 33dcc3c9 to 53638179 68 days ago

ggerganov force pushed from 53638179 to 7b004292 68 days ago

ggerganov force pushed from d2415830 to 4a0ec58d 68 days ago

ggerganov force pushed from d04f8241 to fa2573e3 68 days ago

ggerganov marked this pull request as ready for review 68 days ago

ggerganov force pushed from c96c48c6 to 5c00eb22 68 days ago

slaren commented on 2025-07-04

ggerganov force pushed from a00dba75 to ffe7f637 65 days ago

ggerganov force pushed from ffe7f637 to 832cd921 65 days ago

ggerganov force pushed from d69b376b to 2aa6fa09 65 days ago

ggerganov force pushed from 2aa6fa09 to f23950a6 62 days ago

ggerganov force pushed from f23950a6 to ab82dc20 61 days ago

ggerganov requested a review from

JohannesGaessler 61 days ago

ggerganov added hot

github-actions added testing

github-actions added Nvidia GPU

kv-cache : prepare K/V buffers for separation

be82648b

batched-bench : fix oob write

5a354755

llama : add "virtual sequences"

45ecf841

llama : use "stream" vs "virtual sequence"

4c2d6510

graph : fix stream splitting when KV cache is not used

0d05acd6

kv-cache : add multi-stream save/load support

247015ee

llama : add "--attn-streams" flag

3354ce7e

kv-cache : fix handling when find_slot fails

18fb95dd

kv-cache : restore find_slot impl

cbe971ae

kv-cache : add comments

1b4fbc8f

kv-cache : add bounds checks for sequence id

8bf7fec0

cont : add n_seq_max to batch allocr

91751ead

kv-cache : perform stream copies lazily after llama_synchronize

2d08a395

kv-cache : avoid throwing exceptions across the C boundary

69169b15

CUDA: 4D FlashAttention support (#14628)

886d3f15

ggerganov force pushed from c43f275d to 886d3f15 60 days ago

llama : rename attn_streams -> kv_unified

fb8150d8

slaren approved these changes on 2025-07-16

common : rename kv_split -> kv_unified

318c4f8f

ggerganov merged 225e7a14 into master 56 days ago

ggerganov deleted the gg/llama-high-throughput branch 56 days ago

Reviewers

slaren

compilade

JohannesGaessler

Assignees

No one assigned

Labels

testing Nvidia GPU examples ggml Apple Metal hot

Milestone

No milestone

llama.cpp llama : add high-throughput mode #14363 Merged

llama : add high-throughput mode #14363

llama.cpp
llama : add high-throughput mode
#14363

Merged