[RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40129)
Summary:
https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)).
This PR reverts the revert, and adds diffs that should repair the misconfigured test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129
Differential Revision: D22079377
Pulled By: albanD
fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1