fix: Validate fp16.loss_scale is finite and non-negative (#7889)
Validate fp16.loss_scale is finite and non-negative
Add a Pydantic field validator to DeepSpeedFP16Config to reject
NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as
dynamic loss scaling). This prevents invalid configs from silently
initializing and causing NaNs during training.
Test:
Run pytest -q tests/unit/runtime/test_precision_config_loss_scale.py
Result:
```
root@72170d0458e9:/home/DeepSpeed_woo# pytest -q tests/unit/runtime/test_precision_config_loss_scale.py
=================================================================== test session starts ===================================================================
platform linux -- Python 3.11.10, pytest-8.3.5, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
Using --randomly-seed=1526199052
rootdir: /home/DeepSpeed_woo/tests
configfile: pytest.ini
plugins: xdist-3.8.0, randomly-4.0.1, forked-1.6.0, anyio-4.6.0
collected 10 items
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[3] PASSED [ 10%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[0] PASSED [ 20%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[inf] PASSED [ 30%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[1] PASSED [ 40%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[nan] PASSED [ 50%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[2.0] PASSED [ 60%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[True] PASSED [ 70%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale0] PASSED [ 80%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[-1] PASSED [ 90%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale1] PASSED [100%]
(30 durations < 1s hidden. Use -vv to show these durations.)
============================================================= 10 passed, 16 warnings in 4.18s =============================================================
```
Fix issue #7852
---------
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>