Add assertion on any NaN error on the error feedback (#49374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374
After the assertion is added, the NaN error on certain trainings disappears.
It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118572471
Test Plan:
Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8
To reproduce the error, just comment out the assertion.
Reviewed By: rohan-varma
Differential Revision: D25548299
fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5