fix(autoep+muon): support Moonlight noaux routing and grouped expert NS
Three changes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with
AutoEP + Muon + ZeRO-2:
1. e_score_correction_bias: copy the pretrained noaux_tc score-correction
bias from the source gate into AutoEP routers and apply it in the
TokenChoiceTopKRouter forward pass so expert selection matches the
pretrained checkpoint.
2. is_expert_group: mark GroupedExperts w1/w2/w3 tensors with
is_expert_group=True so Muon applies Newton-Schulz independently per
expert slice rather than treating the stacked (E, I, O) tensor as a
single matrix. muon_update grows an is_expert_group kwarg; all four
call sites inside original_muon.py and the ZeRO-2 path in
stage_1_and_2.py pass getattr(p, 'is_expert_group', False).
3. Muon + MoE param groups in engine.py: flatten dict-style param groups
produced by configure_moe_param_groups before filtering by use_muon;
re-tag optimizer flags after AutoEP layer replacement; add name keys
for MoE group splitting; call split_params_into_different_moe_groups
when the model has MoE layers.
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>