Back out "`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415)" (#98613)
Summary: This change causes multi-GPU job from XI team to hang after 8K steps.
Differential Revision: D44797248
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613
Approved by: https://github.com/ngimel