vllm
a97bd607 - [Core] Add Prefill Context Parallelism (PCP) and All-to-All DCP communication

Commit
46 days ago
[Core] Add Prefill Context Parallelism (PCP) and All-to-All DCP communication This PR adds Prefill Context Parallelism (PCP) support for splitting prefill tokens across ranks using a DualChunkSwap pattern, and integrates an All-to-All communication backend for Decode Context Parallelism (DCP). Key changes: - Add PCP with DualChunkSwap token partitioning for balanced prefill computation - Add All-to-All DCP communication backend reducing NCCL calls from 3 to 2 - Restrict DCP+PCP to two clean configurations: - Case 1: DCP = PCP (same TP position, all-reduce only) - Case 2: DCP = TP × PCP (full TP all-gather, all-reduce + slice) - Add PCPManager for buffer management and input partitioning - Update attention backends (FlashAttention, FlashInfer, MLA) for PCP support - Add comprehensive tests for DCP operations Co-Authored-By: QiuChunshuo <qiuchunshuo@huawei.com> Co-Authored-By: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-Authored-By: FENP <yuanyongjie.yyj@antgroup.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Committer
Parents
Loading