DeepSeek V4 (#24162)
* convert: add dsv4 conversion
* add basic setup
* add llm_graph_input_dsv4
* add save-load state
* add sinkhorn eps - correction by @fairydreaming
* add rope fix
* cleanup dead code
* fix bugs
* support pro model: added by @fairydreaming
* remove redundant V cache
* Chat template
* remove debugging leftovers
* Add mechanism for inlining templates based on architecture
* s/deepseek-v4-flash/deepseek4/g
* s/deepseek-v4-flash/deepseek4/g continued
* enable graph reuse
* enable FA
* fix test llama archs
* rename
* compatibility with antirez ds4 GGUFs
* simplified set_gguf_parameters() by calling super class method, replaced moe.score_func with expert_gating_func.
* reserve worst-case kv-cache
* revert max split inputs
* address review comments
* add padding to enable FA
* pad only the final value of plan.n_kv to 256
* remove built-in cpp chat template
* cont: remove cpp built-in template
* rm outdated test
* replace ggml_view_3d() with ggml_reshape_3d()
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* only support n_seq=1 for now
* remove unused var
* cont: remove unused var
* use scale bias
* use correct ptr for can_reuse
* remove gen-chat-inline-templates.py
* simplify graph reuse
* cont: cleanup
* remove unused inputs
* enable partial checkpointing
* add correct shape for kq_mask + set llama_model_n_swa to 0 for dsv4
* precompute source_idx + add comment about dummy write
* support multi-seq
* remove restored_trim_pos
* use split_equal when possible
* fix indent
* address review comments
* use LLM_KV
* fix ci
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>