Fix expert grad scaling problem with ZeRO optimizer (#6546)
Fix [#6545]
work:
- expert gradient average: divide edp_world_size -> divide dp_world_size
- unit test: make sure model with different dp/ep has same expert
gradient
---------
Co-authored-by: wangyiou <wangyiou@xiaohongshu.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>