fix(rollout): graph capture CB - allocate sufficient cache for multi-round decode
With continuous batching + graph capture, decode_pos advances globally
across all rounds. n=8 cb=4 needs 2 rounds, requiring
prompt_len + ceil(n/cb) * max_new_tokens cache positions.
Previously max_cache_len = prompt_len + max_new_tokens caused OOB
writes on replacement rounds, producing EOS/garbage for replaced slots.
Also adds slot_position tracking for correct per-slot RoPE position_ids
when slots are replaced at arbitrary decode steps.