DeepSpeed
e2f626a5 - fix(autoep+muon): support Moonlight noaux routing and grouped expert NS

Commit
9 days ago
fix(autoep+muon): support Moonlight noaux routing and grouped expert NS Three changes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with AutoEP + Muon + ZeRO-2: 1. e_score_correction_bias: copy the pretrained noaux_tc score-correction bias from the source gate into AutoEP routers and apply it in the TokenChoiceTopKRouter forward pass so expert selection matches the pretrained checkpoint. 2. is_expert_group: mark GroupedExperts w1/w2/w3 tensors with is_expert_group=True so Muon applies Newton-Schulz independently per expert slice rather than treating the stacked (E, I, O) tensor as a single matrix. muon_update grows an is_expert_group kwarg; all four call sites inside original_muon.py and the ZeRO-2 path in stage_1_and_2.py pass getattr(p, 'is_expert_group', False). 3. Muon + MoE param groups in engine.py: flatten dict-style param groups produced by configure_moe_param_groups before filtering by use_muon; re-tag optimizer flags after AutoEP layer replacement; add name keys for MoE group splitting; call split_params_into_different_moe_groups when the model has MoE layers. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Author
Committer
Parents
Loading