llama.cpp
server: add router device memory margin parameter for dynamic unloading
#21231
Open

server: add router device memory margin parameter for dynamic unloading #21231

0cc4m wants to merge 20 commits into master from 0cc4m/server-memory-limit
0cc4m
0cc4m 0cc4m requested a review 72 days ago
0cc4m 0cc4m requested a review 72 days ago
ngxson
ngxson commented on 2026-03-31
ngxson
ngxson commented on 2026-03-31
github-actions github-actions added examples
github-actions github-actions added server
0cc4m 0cc4m requested a review from ggerganov ggerganov 71 days ago
ggerganov
0cc4m
0cc4m 0cc4m changed the title server: add router max memory parameter for dynamic unloading server: add router device memory margin parameter for dynamic unloading 71 days ago
0cc4m
ggerganov
ggerganov commented on 2026-04-02
0cc4m 0cc4m force pushed from 4312ed2a to 1d4a5f93 70 days ago
ggerganov
ggerganov commented on 2026-04-03
0cc4m
0cc4m 0cc4m requested a review from ggerganov ggerganov 62 days ago
0cc4m 0cc4m requested a review from ngxson ngxson 62 days ago
ggerganov
ggerganov ggerganov assigned ggerganov ggerganov 62 days ago
0cc4m 0cc4m force pushed from 0124ec9e to 3c53be14 60 days ago
ggerganov
ggerganov commented on 2026-04-16
0cc4m 0cc4m force pushed from 61c25687 to cf0ebc4e 51 days ago
0cc4m 0cc4m force pushed from cf0ebc4e to da1f1688 41 days ago
MGAndreasen
0cc4m
ggerganov
ggerganov commented on 2026-05-04
ggerganov
ggerganov commented on 2026-05-04
ggerganov
ggerganov commented on 2026-05-04
ngxson
ngxson commented on 2026-05-13
0cc4m 0cc4m force pushed from da1f1688 to d65d956b 28 days ago
0cc4m 0cc4m force pushed from d65d956b to 0bb8e548 28 days ago
0cc4m
0cc4m
ggerganov
0cc4m 0cc4m force pushed from 5fa97b12 to 6adf9643 22 days ago
0cc4m
danbev
danbev commented on 2026-05-21
ServeurpersoCom
ServeurpersoCom
0cc4m
ServeurpersoCom
ServeurpersoCom
0cc4m
ServeurpersoCom
ServeurpersoCom
ORippler
0cc4m 0cc4m force pushed from 6adf9643 to 82403fdc 13 days ago
0cc4m 0cc4m force pushed from 82403fdc to 645d17ea 4 days ago
0cc4m
0cc4m server: add --models-memory-max parameter to allow dynamically unload…
34a9a7e5
0cc4m estimate with to-be-loaded model size included
716cd77e
0cc4m use no_alloc to get memory requirements for model load
d6dac7e9
0cc4m only set model memory_mb if not previously calculated
40f8b387
0cc4m use memory margin instead of total size limit, apply to each device s…
fdca28e9
0cc4m add server memory debug logging
3fe090f2
0cc4m move llama_context_device_memory function to llama-ext.h
7a266473
0cc4m fix model count exceeded check
91b0d08c
0cc4m improve memory_per_device map naming
8973faab
0cc4m improve variable naming, fix style
fdfda6b5
0cc4m also strip models memory margin from child processes
bdd79f03
ggerganov cont : clean-up
a45085e9
0cc4m handle models that need to be downloaded before estimation
b4c56304
0cc4m load directly from downloaded state
ffd27c69
0cc4m replace device memory map with buft memory map. Use llama_get_memory_…
efb55e71
0cc4m extract duplicated check into helper function
7d58d31c
0cc4m move model memory estimation to subprocess
44139bc7
0cc4m precompute name->buft map, map GPU host types to CPU buft
409120f5
0cc4m cleanup unused variable
ca54eda6
0cc4m remove duplicated init calls
37a767f5
0cc4m 0cc4m force pushed from 645d17ea to 37a767f5 23 hours ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
Labels
Milestone