Add OPSD vLLM rollout scaffold, Qwen2/Qwen3 weight bridges, and README
Lands the second-stage rollout path, weight-sync infrastructure, and the
example app's README. Includes:
* VLLMRollout that constructs vllm.LLM on training rank 0 and broadcasts
generated token ids to peer ranks, with disjoint-GPU (subprocess) and
shared (in-process) topology paths. Weight sync gathers ZeRO-3 params
cooperatively then pushes to vLLM via LLM.collective_rpc("load_weights").
* WeightBridge ABC with COLUMN / ROW / VOCAB / REPLICATED parallel kinds
and an even-slice per-rank slicer; Qwen2WeightBridge with the full
per-parameter table for Qwen2 / Qwen2.5; Qwen3WeightBridge adding the
per-head q_norm / k_norm tensors as REPLICATED.
* vLLM-side prompt+response stitching factored into stitch_rollout() so
its index math is unit-testable without a live vLLM.
* CPU-only tests: tests/test_weight_bridge.py covers parallel-kind
dispatch, per-rank shape/gather round-trips across tp_size in {1,2,4},
indivisibility / invalid-rank guards, and the registry;
tests/test_vllm_stitch.py covers prompt/response stitching for the
common shapes including variable response lengths and left-padded
prompts.
* configs + launch scripts for both production and smoke vLLM runs.
**Known blocker called out in README and module docstring:** vLLM's worker
init calls new_group() on the global process group, which deadlocks when
launched under the standard `deepspeed --num_gpus N` launcher (rank 0
calls vLLM, other ranks never participate in vLLM's collective). The
documented fix is the TRL/OpenRLHF separate-server pattern; this PR lands
the scaffolding so that work can begin against a green codebase.
Signed-off-by: Zhipeng Wang <zhipengbayern@gmail.com>