DeepSpeed
b9af5d8d - Fix: Update grad norm calculation for CPU offload (#7302)

Commit

218 days ago

Fix: Update grad norm calculation for CPU offload (#7302) ## Description This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the `averaged_gradients` are not being updated with the clipped gradients when CPU offloading is active. ## Problem When using CPU offloading with gradient clipping: 1. The gradients are successfully clipped using `safe_set_local_grad` 2. However, the `_global_grad_norm` calculation still uses the original unclipped gradients. 3. This leads to incorrect gradient norm reporting and potential issues with gradient clipping effectiveness ## Solution The fix ensures that the `averaged_gradients` are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled. ## Testing The fix has been tested with: - CPU offloading enabled and disabled - Different gradient clipping values - A simple model with linear layers - Both FP16 and BF16 ## Related Issues Fixes #7292 --------- Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>

References

#7302 - Fix: Update grad norm calculation for CPU offload

Author

therealnaveenkamal

Parents

17c8be07

DeepSpeed b9af5d8d - Fix: Update grad norm calculation for CPU offload (#7302)

DeepSpeed
b9af5d8d - Fix: Update grad norm calculation for CPU offload (#7302)