[caffe2] make order btw div and mul in adgrad consistent (#32974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32974
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/286
Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad
There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad.
The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row
And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit)
Test Plan: CI
Reviewed By: wx1988
Differential Revision: D19342865
fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341