DeepSpeed
dfc78347 - fix: BF16_Optimizer selection and compatibility issues

Commit
36 days ago
fix: BF16_Optimizer selection and compatibility issues Several bugs were causing test_bf16_optimizer_fragments to fail: 1. DDP_BFLOAT16 constant collision with BFLOAT16 - Both were set to "bf16", causing BF16_Optimizer to never be selected - Changed DDP_BFLOAT16 to "ddp_bf16" to differentiate 2. Missing attributes in BF16_Optimizer - Added custom_loss_scaler, external_loss_scale, torch_autocast_gradscaler - These are required by base_optimizer.py's needs_scaler() and scale_if_loss() 3. scale_if_loss() assumed loss_scaler always exists - Added hasattr check before calling loss_scaler.scale_loss() 4. Test config missing grad_accum_dtype - Added data_types.grad_accum_dtype=fp32 to ensure BF16_Optimizer is used - Without this, FP16_Optimizer is used which doesn't support tensor fragment APIs 5. Added DS_DISABLE_REUSE_DIST_ENV support in tests/unit/common.py - Allows disabling reuse_dist_env via environment variable for CI Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Committer
Parents
Loading