1. allow not resetting gpu_sum when calling has_overflow, 2. Clone the param grad before putting it in the swap out gradient buffer since underlying param.grad buffer is reused. 3. Do not return -1 for norm when total norm is NaN Inf. Just return the computed value