DeepSpeed
61daaa1e - Optimize grad_norm calculations by reducing device/host dependency (#4974)

Commit
1 year ago
Optimize grad_norm calculations by reducing device/host dependency (#4974) Device tensors are being cast into python variable which requires device/host transactions. while all calculations can be done on the device using pytorch native operations. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Author
Parents
Loading