Add continuous batching with shared prefix, KV trim, and early exit to OPSD rollout engine
Port the three generation optimizations from GRPO trainer into
HybridEngineRollout:
1. Shared prefix: prefill prompt KV once, expand for n samples
2. Continuous batching: fixed slot count, replace finished slots
with next pending rollout via left-padded KV injection
3. KV cache left-trim: trim common leading padding (threshold=16)
4. Early exit / batch compaction: shrink batch via reorder_cache
when no pending rollouts remain
Activated via RolloutConfig.continuous_batching_size > 0.
When 0 (default), falls back to HF generate (unchanged behavior).
Validated on 2xA100-40GB with Qwen2.5-0.5B student + 1.5B teacher,
n_samples_per_prompt=32, cb_size=8, 5 steps with finite loss.
Signed-off-by: Guoyang Ma <gma@dgx-a100-a>