server: router fix model unload reload deadlock #22284
server: add --models-memory-max parameter to allow dynamically unload…
8e8e2007
estimate with to-be-loaded model size included
777395f6
use no_alloc to get memory requirements for model load
2603b4c5
only set model memory_mb if not previously calculated
9b5af58a
use memory margin instead of total size limit, apply to each device s…
56122b35
add server memory debug logging
51538c1f
move llama_context_device_memory function to llama-ext.h
ba2521c6
fix model count exceeded check
75000630
improve memory_per_device map naming
173da43c
improve variable naming, fix style
69e30861
also strip models memory margin from child processes
eb2cf73f
cont : clean-up
1a8aec0a
handle models that need to be downloaded before estimation
b1623a61
load directly from downloaded state
cf0ebc4e
server: keep router model refcount to avoid unloading models that hav…
a5355a02
Assignees
No one assigned
Labels
examples
python
server
Login to write a write a comment.
Login via GitHub