DeepSpeed
991ebd72 - fix checkpointing/loading of z0+bf16 (#7786)

Commit
37 days ago
fix checkpointing/loading of z0+bf16 (#7786) When using `bf16=True` with `zero_optimization.stage=0`, the optimizer state is not saved or loaded during checkpointing. The optimizer's `step` counter and other states (`exp_avg`, `exp_avg_sq`) are lost after loading a checkpoint. This PR addresses the issue by fixing a flag indicating the config and adds a test arg to cover the problematic case. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Parents
Loading