llama.cpp
4e732e0a - llama: allow partial seq_rm for GDN models for speculative decoding

Commit

1 day ago

llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.

Author

am17an

Committer

am17an

Parents

e7b48481

llama.cpp 4e732e0a - llama: allow partial seq_rm for GDN models for speculative decoding

llama.cpp
4e732e0a - llama: allow partial seq_rm for GDN models for speculative decoding