DeepSpeed
6384396b - Add OPSD vLLM rollout scaffold, Qwen2/Qwen3 weight bridges, and README

Commit
28 days ago
Add OPSD vLLM rollout scaffold, Qwen2/Qwen3 weight bridges, and README Lands the second-stage rollout path, weight-sync infrastructure, and the example app's README. Includes: * VLLMRollout that constructs vllm.LLM on training rank 0 and broadcasts generated token ids to peer ranks, with disjoint-GPU (subprocess) and shared (in-process) topology paths. Weight sync gathers ZeRO-3 params cooperatively then pushes to vLLM via LLM.collective_rpc("load_weights"). * WeightBridge ABC with COLUMN / ROW / VOCAB / REPLICATED parallel kinds and an even-slice per-rank slicer; Qwen2WeightBridge with the full per-parameter table for Qwen2 / Qwen2.5; Qwen3WeightBridge adding the per-head q_norm / k_norm tensors as REPLICATED. * vLLM-side prompt+response stitching factored into stitch_rollout() so its index math is unit-testable without a live vLLM. * CPU-only tests: tests/test_weight_bridge.py covers parallel-kind dispatch, per-rank shape/gather round-trips across tp_size in {1,2,4}, indivisibility / invalid-rank guards, and the registry; tests/test_vllm_stitch.py covers prompt/response stitching for the common shapes including variable response lengths and left-padded prompts. * configs + launch scripts for both production and smoke vLLM runs. **Known blocker called out in README and module docstring:** vLLM's worker init calls new_group() on the global process group, which deadlocks when launched under the standard `deepspeed --num_gpus N` launcher (rank 0 calls vLLM, other ranks never participate in vLLM's collective). The documented fix is the TRL/OpenRLHF separate-server pattern; this PR lands the scaffolding so that work can begin against a green codebase. Signed-off-by: Zhipeng Wang <zhipengbayern@gmail.com>
Author
Committer
Parents
Loading