Optimize singleton MoE collectives (#7997)
## Summary
This PR avoids unnecessary MoE collectives when the expert-parallel
group has a single rank.
The change is narrow:
- skip the two `MOELayer` all-to-all calls when `ep_size == 1`
- skip top-1/top-2/top-k capacity `all_reduce(MAX)` when the explicit
expert-parallel group has world size 1
- keep the existing collective paths unchanged for non-singleton
expert-parallel groups
## Motivation
For a singleton expert-parallel group, these collectives are identity
operations:
- `all_to_all_single(..., group=ep_group)` has no remote rank to
exchange with
- `all_reduce(..., op=MAX, group=ep_group)` leaves the tensor unchanged
In a downstream `ep_size=1` MoE run, profiling showed repeated singleton
all-to-all and capacity all-reduce calls dominating late-step time. A
local version of this guarded optimization reduced late-step timing from
around 13s/update to around 0.864s/update while keeping loss and MoE
auxiliary loss finite.
This is related to #7141, which also reports `ep_size=1` MoE all-to-all
behavior.
## Correctness
The fastpaths are guarded by the existing MoE runtime structure:
- `MOELayer` skips `_AllToAll.apply(...)` only when `self.ep_size == 1`
- the singleton all-to-all path still calls `.contiguous()`, preserving
the layout normalization previously performed inside `_AllToAll.forward`
- gate capacity reduction checks the runtime world size of the explicit
`ep_group`
- `ep_group=None` is not treated as a singleton expert group
- non-singleton expert-parallel groups still use the original
collectives
This does not change routing, capacity math, expert execution, combine
logic, auxiliary loss, or expert counts.
## Testing
- `pre-commit run --files deepspeed/moe/sharded_moe.py
tests/unit/moe/test_moe.py`
- `git diff --check`
- `pytest --forked tests/unit/moe/test_moe.py -v -k "singleton"`
The targeted pytest command selected 9 singleton tests locally, but they
skipped because this local environment has no accelerator, matching the
existing `DistributedTest` behavior.
Downstream smoke evidence:
- 2-rank H200 run
- top-2 MoE, `drop_tokens=False`
- reached update 476 after the local fix
- finite loss and MoE auxiliary loss
- late-step timing improved from around 13s/update to around
0.864s/update
Signed-off-by: Tianyi Wang <npufranklin@gmail.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>