llama.cpp
cd5e3b57 - server : support unified cache across slots (#16736)

Commit

194 days ago

server : support unified cache across slots (#16736) * server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning

References

#16736 - server : support unified cache across slots

Author

ggerganov

Parents

87c9efc3

llama.cpp cd5e3b57 - server : support unified cache across slots (#16736)

llama.cpp
cd5e3b57 - server : support unified cache across slots (#16736)