llama.cpp
4e732e0a - llama: allow partial seq_rm for GDN models for speculative decoding

Commit
1 day ago
llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.
Author
Committer
Parents
Loading