llama.cpp
e4d2e198 - server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold

Commit

10 hours ago

server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold estimate with to-be-loaded model size included use no_alloc to get memory requirements for model load only set model memory_mb if not previously calculated use memory margin instead of total size limit, apply to each device separately add server memory debug logging move llama_context_device_memory function to llama-ext.h fix model count exceeded check improve memory_per_device map naming improve variable naming, fix style also strip models memory margin from child processes cont : clean-up replace device memory map with buft memory map. Use llama_get_memory_breakdown extract duplicated check into helper function move model memory estimation to subprocess precompute name->buft map, map GPU host types to CPU buft cleanup unused variable remove duplicated init calls

References

0cc4m/server-memory-limit

#21231 - server: add router device memory margin parameter for dynamic unloading

Author

0cc4m

Committer

0cc4m

Parents

277a105d

llama.cpp e4d2e198 - server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold

llama.cpp
e4d2e198 - server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold