Optimize grad_norm calculations by reducing device/host dependency (#4974)
Device tensors are being cast into python variable which requires
device/host transactions. while all calculations can be done on
the device using pytorch native operations.
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>