DeepSpeed
Fix bf16 gradient norm divergence with ZeRO stage 0
#7839
Merged

Fix bf16 gradient norm divergence with ZeRO stage 0 #7839

tohtana
tohtana Fix ZeRO-0 + bf16 broken training: disable loss scaling and fix zero_…
773c607b
tohtana add test
c5f60af0
tohtana tohtana requested a review from tjruwase tjruwase 58 days ago
tohtana tohtana requested a review from loadams loadams 58 days ago
sfc-gh-truwase
sfc-gh-truwase commented on 2026-02-11
tohtana Address PR feedback for issue #7837 loss-scale refactor
459860ce
sfc-gh-truwase
sfc-gh-truwase approved these changes on 2026-02-12
tohtana tohtana merged 1752c2ab into master 55 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone