`FusedAdam(W)` should take `OptState` into account before unscaling grads (#94060)
the optimizers have to consult `OptState` before unscaling gradients because we could call `GradScaler.unscale_` explicitly to for e.g. `clip_grad_norm_` as mentioned in https://github.com/pytorch/pytorch/blob/e52786f3d177a7ca5d490a516cf52e236ef072cb/torch/cuda/amp/grad_scaler.py#L235-L266 and https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients
Related #90752
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94060
Approved by: https://github.com/albanD