[Core] Add Prefill Context Parallelism (PCP) and All-to-All DCP communication
This PR adds Prefill Context Parallelism (PCP) support for splitting prefill
tokens across ranks using a DualChunkSwap pattern, and integrates an All-to-All
communication backend for Decode Context Parallelism (DCP).
Key changes:
- Add PCP with DualChunkSwap token partitioning for balanced prefill computation
- Add All-to-All DCP communication backend reducing NCCL calls from 3 to 2
- Restrict DCP+PCP to two clean configurations:
- Case 1: DCP = PCP (same TP position, all-reduce only)
- Case 2: DCP = TP × PCP (full TP all-gather, all-reduce + slice)
- Add PCPManager for buffer management and input partitioning
- Update attention backends (FlashAttention, FlashInfer, MLA) for PCP support
- Add comprehensive tests for DCP operations
Co-Authored-By: QiuChunshuo <qiuchunshuo@huawei.com>
Co-Authored-By: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-Authored-By: FENP <yuanyongjie.yyj@antgroup.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>