DeepSpeed
bfac5063 - fix(rollout): graph capture CB - allocate sufficient cache for multi-round decode

Commit
36 days ago
fix(rollout): graph capture CB - allocate sufficient cache for multi-round decode With continuous batching + graph capture, decode_pos advances globally across all rounds. n=8 cb=4 needs 2 rounds, requiring prompt_len + ceil(n/cb) * max_new_tokens cache positions. Previously max_cache_len = prompt_len + max_new_tokens caused OOB writes on replacement rounds, producing EOS/garbage for replaced slots. Also adds slot_position tracking for correct per-slot RoPE position_ids when slots are replaced at arbitrary decode steps.
Author
Parents
Loading