refactor(rollout): graph capture CB uses round-based reset instead of oversized cache
Replace immediate slot replacement with round-based batching:
- When all batch slots finish, reset decode_pos to prompt_len
- Refill batch with next rollouts from fresh prefill KV
- max_cache_len = prompt_len + max_new_tokens (no longer scaled by num_rounds)
This halves KV memory for n=8 cb=4 while maintaining 2.17x speedup over
non-graph CB (343 vs 159 tok/s). Tradeoff: slots idle while waiting for
slowest slot in the round to finish (no immediate replacement).