[FSDP] Avoided CPU sync in `clip_grad_norm_` (#122001)
Copying a scalar 0 tensor on CPU to GPU or constructing a scalar 0 tensor on GPU requires a CPU sync with the GPU. Let us avoid doing ops that involve it.
`FSDP.clip_grad_norm_` already first checks if all parameters are not sharded and calls into `nn.utils.clip_grad_norm_`, so at the point of the code changes, there is guaranteed to be some sharded parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122001
Approved by: https://github.com/wanchaol