llama.cpp
e4d2e198 - server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold

Commit
10 hours ago
server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold estimate with to-be-loaded model size included use no_alloc to get memory requirements for model load only set model memory_mb if not previously calculated use memory margin instead of total size limit, apply to each device separately add server memory debug logging move llama_context_device_memory function to llama-ext.h fix model count exceeded check improve memory_per_device map naming improve variable naming, fix style also strip models memory margin from child processes cont : clean-up replace device memory map with buft memory map. Use llama_get_memory_breakdown extract duplicated check into helper function move model memory estimation to subprocess precompute name->buft map, map GPU host types to CPU buft cleanup unused variable remove duplicated init calls
Author
Committer
Parents
Loading