[Gradient Compression] Divide by world size before all_reduce to avoid overflow (#57410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57410
FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.
ghstack-source-id: 127877083
Test Plan:
before chage
f268909897
after change:
f270950609
If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook.
Reviewed By: SciPioneer
Differential Revision: D28128628
fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0