llama.cpp
server: add router device memory margin parameter for dynamic unloading
#21231

Open

server: add router device memory margin parameter for dynamic unloading #21231

0cc4m wants to merge 20 commits into master from 0cc4m/server-memory-limit

0cc4m requested a review 72 days ago

ngxson commented on 2026-03-31

github-actions added examples

github-actions added server

0cc4m requested a review from

ggerganov 71 days ago

0cc4m changed the title ~~server: add router max memory parameter for dynamic unloading~~ server: add router device memory margin parameter for dynamic unloading 71 days ago

ggerganov commented on 2026-04-02

0cc4m force pushed from 4312ed2a to 1d4a5f93 70 days ago

ggerganov commented on 2026-04-03

0cc4m requested a review from

ggerganov 62 days ago

0cc4m requested a review from

ngxson 62 days ago

ggerganov assigned

ggerganov 62 days ago

0cc4m force pushed from 0124ec9e to 3c53be14 60 days ago

ggerganov commented on 2026-04-16

0cc4m force pushed from 61c25687 to cf0ebc4e 51 days ago

0cc4m force pushed from cf0ebc4e to da1f1688 41 days ago

ggerganov commented on 2026-05-04

ngxson commented on 2026-05-13

0cc4m force pushed from da1f1688 to d65d956b 28 days ago

0cc4m force pushed from d65d956b to 0bb8e548 28 days ago

0cc4m force pushed from 5fa97b12 to 6adf9643 22 days ago

danbev commented on 2026-05-21

0cc4m force pushed from 6adf9643 to 82403fdc 13 days ago

0cc4m force pushed from 82403fdc to 645d17ea 4 days ago

server: add --models-memory-max parameter to allow dynamically unload…

34a9a7e5

estimate with to-be-loaded model size included

716cd77e

use no_alloc to get memory requirements for model load

d6dac7e9

only set model memory_mb if not previously calculated

40f8b387

use memory margin instead of total size limit, apply to each device s…

fdca28e9

add server memory debug logging

3fe090f2

move llama_context_device_memory function to llama-ext.h

7a266473

fix model count exceeded check

91b0d08c

improve memory_per_device map naming

8973faab

improve variable naming, fix style

fdfda6b5

also strip models memory margin from child processes

bdd79f03

cont : clean-up

a45085e9

handle models that need to be downloaded before estimation

b4c56304

load directly from downloaded state

ffd27c69

replace device memory map with buft memory map. Use llama_get_memory_…

efb55e71

extract duplicated check into helper function

7d58d31c

move model memory estimation to subprocess

44139bc7

precompute name->buft map, map GPU host types to CPU buft

409120f5

cleanup unused variable

ca54eda6

remove duplicated init calls

37a767f5

0cc4m force pushed from 645d17ea to 37a767f5 23 hours ago

Reviewers

danbev

ngxson

ggerganov

Assignees

ggerganov

Labels

examples server

Milestone

No milestone

llama.cpp server: add router device memory margin parameter for dynamic unloading #21231 Open

server: add router device memory margin parameter for dynamic unloading #21231

llama.cpp
server: add router device memory margin parameter for dynamic unloading
#21231

Open