Fix AutoEP + Muon compatibility for batched expert tensors
1. gram_newtonschulz: replace torch.addmm (2D only) with equivalent
a*Q + Z@Q to support batched 3D expert weight tensors
[num_local_experts, n, m]. Also fix diagonal() to specify dim1/dim2
for 3D tensors.
2. deepseek_v3 preset: remove e_score_correction_bias from
unsupported_router_bias_names since auto_ep_layer.py already
copies it correctly (lines 398-402).
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>