DeepSpeed
9b4d823e - refactor(rollout): graph capture CB uses round-based reset instead of oversized cache

Commit
26 days ago
refactor(rollout): graph capture CB uses round-based reset instead of oversized cache Replace immediate slot replacement with round-based batching: - When all batch slots finish, reset decode_pos to prompt_len - Refill batch with next rollouts from fresh prefill KV - max_cache_len = prompt_len + max_new_tokens (no longer scaled by num_rounds) This halves KV memory for n=8 cb=4 while maintaining 2.17x speedup over non-graph CB (343 vs 159 tok/s). Tradeoff: slots idle while waiting for slowest slot in the round to finish (no immediate replacement).
Author
Parents
Loading