DeepSpeed
61daaa1e - Optimize grad_norm calculations by reducing device/host dependency (#4974)

Commit

2 years ago

Optimize grad_norm calculations by reducing device/host dependency (#4974) Device tensors are being cast into python variable which requires device/host transactions. while all calculations can be done on the device using pytorch native operations. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

References

#4974 - Optimize grad_norm calculations by reducing device/host dependency

Author

nelyahu

Parents

19e0dc39

DeepSpeed 61daaa1e - Optimize grad_norm calculations by reducing device/host dependency (#4974)

DeepSpeed
61daaa1e - Optimize grad_norm calculations by reducing device/host dependency (#4974)