DeepSpeed
b647fb24 - Fix expert grad scaling problem with ZeRO optimizer (#6546)

Commit
1 year ago
Fix expert grad scaling problem with ZeRO optimizer (#6546) Fix [#6545] work: - expert gradient average: divide edp_world_size -> divide dp_world_size - unit test: make sure model with different dp/ep has same expert gradient --------- Co-authored-by: wangyiou <wangyiou@xiaohongshu.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Author
Parents
Loading