Commit
10 days ago
Add AutoEP (#7938) This PR adds AutoEP (Automatic Expert Parallelism) to DeepSpeed training for HuggingFace MoE models. AutoEP detects MoE blocks during `deepspeed.initialize()`, builds the required EP/EDP process groups, and replaces supported MoE blocks with an EP-enabled execution path, so expert parallelism can be enabled with DeepSpeed config only and without model code changes. Current scope in this PR is the base AutoEP feature: - ZeRO stages 0, 1, and 2 support - checkpoint save/load support - universal checkpoint conversion support ZeRO-3 extensions are intentionally left as follow-up work (#7928 should be merged for this work) Supported presets in this PR: - Mixtral - Qwen3-MoE - DeepSeek-V2 - DeepSeek-V3 For end-to-end benchmarking and testing, an AutoEP example is available in DeepSpeedExamples: - <https://github.com/tohtana/DeepSpeedExamples/tree/tohtana/add_auto_ep/training/expert_parallel> ## Attribution This implementation substantially builds on TorchTitan's MoE / expert-parallel implementation, and we want to explicitly acknowledge that prior work. The TorchTitan-derived pieces in this PR are primarily: - `deepspeed/moe/ep_router.py`: adapted from TorchTitan's `TokenChoiceTopKRouter` - `deepspeed/moe/ep_experts.py`: adapted from TorchTitan's `GroupedExperts` and grouped-GEMM expert execution path - `deepspeed/moe/ep_kernels.py`: adapted from TorchTitan's `TokenReorderer`, `generate_permute_indices`, Triton fill-indices kernel, and token-group alignment / padding helpers - `deepspeed/module_inject/auto_ep_layer.py`: adapts the same router -> reorder -> dispatch -> local expert compute -> combine structure used in TorchTitan's MoE / EP flow Relevant TorchTitan sources: - <https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/moe/moe.py> - <https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/moe/kernels.py> - <https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/moe/utils.py> - <https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/expert_parallel.py> The DeepSpeed-specific work in this PR is the AutoEP integration layer around those building blocks: - HuggingFace MoE detection and structural validation - model-family presets and custom-config path - weight repacking from HF expert layouts into grouped expert tensors - DeepSpeed runtime group setup and module replacement - DeepSpeed checkpoint save/load and universal checkpoint support - DeepSpeed docs and tests ## Design The implementation is split into a few layers: - `deepspeed/module_inject/auto_ep_config.py` - user config parsing - built-in model presets - validation for EP topology and per-model constraints - `deepspeed/module_inject/auto_ep.py` - scans the model for MoE blocks - validates the detected structure - builds a `MoELayerSpec` for each supported MoE layer - replaces the original HF block with `AutoEPMoELayer` - `deepspeed/module_inject/auto_ep_layer.py` - the drop-in execution wrapper for a detected MoE block - implements router execution, token reorder, EP dispatch/combine, local expert compute, and shared-expert merge - `deepspeed/moe/ep_router.py`, `deepspeed/moe/ep_experts.py`, `deepspeed/moe/ep_kernels.py` - reusable MoE runtime pieces for routing, grouped expert compute, token permutation, and aligned grouped-GEMM execution - `deepspeed/moe/ep_repack.py` - converts HF expert weights into the grouped expert layout expected by the runtime - `deepspeed/runtime/engine.py` and checkpoint conversion code - wires AutoEP into `deepspeed.initialize()` - handles checkpoint save/load metadata and universal checkpoint integration At runtime, the execution path is: 1. detect and replace supported HF MoE blocks during initialization 2. route tokens with the EP router 3. reorder tokens by expert assignment 4. perform all-to-all dispatch across the EP group when `autoep_size > 1` 5. run local grouped expert compute 6. all-to-all combine and restore the original token order 7. merge shared experts if the model has them ## Adding new model support There are two supported ways to extend AutoEP to a new MoE model family. 1. Add a preset in `PRESET_MODELS`. This is the preferred path for a model family we want to support out of the box. A preset defines: - MoE layer pattern - router child name - experts child name - expert weight names / layout - `num_experts` and `top_k` config attributes - routing defaults - optional shared-expert structure 2. Use the custom config path. For models that are not yet built into DeepSpeed, AutoEP can be driven from config with: - `moe_layer_pattern` - `router_pattern` - `expert_pattern` - `expert_w1`, `expert_w2`, `expert_w3` - `num_experts_attr` - `top_k_attr` - optional shared-expert fields Once detection can produce a valid `MoELayerSpec`, the replacement, execution, and checkpoint paths are shared. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com> Signed-off-by: Guokai Ma <guokai.ma@intel.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Guokai Ma <guokai.ma@intel.com>
Author
Parents
Loading