make the order btw div and mul in adagrad update consistent (#30449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30449
There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad.
In this diff we first compute effective_lr = lr / (sqrt(moment) + epsilon) and then multiply with gradient.
Test Plan: CI
Reviewed By: protonu
Differential Revision: D18703416
fbshipit-source-id: 2a8b2a3f5401466549561412bd22f07abac3c598