DeepSpeed
63eeb114 - fix: Validate fp16.loss_scale is finite and non-negative (#7889)

Commit
18 days ago
fix: Validate fp16.loss_scale is finite and non-negative (#7889) Validate fp16.loss_scale is finite and non-negative Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training. Test: Run pytest -q tests/unit/runtime/test_precision_config_loss_scale.py Result: ``` root@72170d0458e9:/home/DeepSpeed_woo# pytest -q tests/unit/runtime/test_precision_config_loss_scale.py =================================================================== test session starts =================================================================== platform linux -- Python 3.11.10, pytest-8.3.5, pluggy-1.6.0 -- /usr/bin/python cachedir: .pytest_cache Using --randomly-seed=1526199052 rootdir: /home/DeepSpeed_woo/tests configfile: pytest.ini plugins: xdist-3.8.0, randomly-4.0.1, forked-1.6.0, anyio-4.6.0 collected 10 items tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[3] PASSED [ 10%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[0] PASSED [ 20%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[inf] PASSED [ 30%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[1] PASSED [ 40%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[nan] PASSED [ 50%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[2.0] PASSED [ 60%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[True] PASSED [ 70%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale0] PASSED [ 80%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[-1] PASSED [ 90%] tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale1] PASSED [100%] (30 durations < 1s hidden. Use -vv to show these durations.) ============================================================= 10 passed, 16 warnings in 4.18s ============================================================= ``` Fix issue #7852 --------- Signed-off-by: nathon-lee <leejianwoo@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Author
Parents
Loading