Fix bf16 gradient norm divergence with ZeRO stage 0 (#7839)
Fixes: #7837
ZeRO-0 + bf16 has two bugs in `engine.py`:
1. `FP16_UnfusedOptimizer` applies `dynamic_loss_scale` with
`cur_scale=65536` but `engine.backward()` never scales the loss, so
`step()` divides gradients by 65536
2. `_take_model_step` skips `zero_grad` for bf16 without ZeRO, causing
gradient accumulation.
Fix: disable loss scaling for bf16 and remove the `zero_optimization()`
gate on `zero_grad`.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>