DeepSpeed
55491413 - Optimize singleton MoE collectives (#7997)

Commit
12 days ago
Optimize singleton MoE collectives (#7997) ## Summary This PR avoids unnecessary MoE collectives when the expert-parallel group has a single rank. The change is narrow: - skip the two `MOELayer` all-to-all calls when `ep_size == 1` - skip top-1/top-2/top-k capacity `all_reduce(MAX)` when the explicit expert-parallel group has world size 1 - keep the existing collective paths unchanged for non-singleton expert-parallel groups ## Motivation For a singleton expert-parallel group, these collectives are identity operations: - `all_to_all_single(..., group=ep_group)` has no remote rank to exchange with - `all_reduce(..., op=MAX, group=ep_group)` leaves the tensor unchanged In a downstream `ep_size=1` MoE run, profiling showed repeated singleton all-to-all and capacity all-reduce calls dominating late-step time. A local version of this guarded optimization reduced late-step timing from around 13s/update to around 0.864s/update while keeping loss and MoE auxiliary loss finite. This is related to #7141, which also reports `ep_size=1` MoE all-to-all behavior. ## Correctness The fastpaths are guarded by the existing MoE runtime structure: - `MOELayer` skips `_AllToAll.apply(...)` only when `self.ep_size == 1` - the singleton all-to-all path still calls `.contiguous()`, preserving the layout normalization previously performed inside `_AllToAll.forward` - gate capacity reduction checks the runtime world size of the explicit `ep_group` - `ep_group=None` is not treated as a singleton expert group - non-singleton expert-parallel groups still use the original collectives This does not change routing, capacity math, expert execution, combine logic, auxiliary loss, or expert counts. ## Testing - `pre-commit run --files deepspeed/moe/sharded_moe.py tests/unit/moe/test_moe.py` - `git diff --check` - `pytest --forked tests/unit/moe/test_moe.py -v -k "singleton"` The targeted pytest command selected 9 singleton tests locally, but they skipped because this local environment has no accelerator, matching the existing `DistributedTest` behavior. Downstream smoke evidence: - 2-rank H200 run - top-2 MoE, `drop_tokens=False` - reached update 476 after the local fix - finite loss and MoE auxiliary loss - late-step timing improved from around 13s/update to around 0.864s/update Signed-off-by: Tianyi Wang <npufranklin@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Parents
Loading