DeepSpeed
1752c2ab - Fix bf16 gradient norm divergence with ZeRO stage 0 (#7839)

Commit

49 days ago

Fix bf16 gradient norm divergence with ZeRO stage 0 (#7839) Fixes: #7837 ZeRO-0 + bf16 has two bugs in `engine.py`: 1. `FP16_UnfusedOptimizer` applies `dynamic_loss_scale` with `cur_scale=65536` but `engine.backward()` never scales the loss, so `step()` divides gradients by 65536 2. `_take_model_step` skips `zero_grad` for bf16 without ZeRO, causing gradient accumulation. Fix: disable loss scaling for bf16 and remove the `zero_optimization()` gate on `zero_grad`. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

References

#7839 - Fix bf16 gradient norm divergence with ZeRO stage 0

Author

tohtana

Parents

2c362837

DeepSpeed 1752c2ab - Fix bf16 gradient norm divergence with ZeRO stage 0 (#7839)

DeepSpeed
1752c2ab - Fix bf16 gradient norm divergence with ZeRO stage 0 (#7839)