Refine gradient accumulation (on device training) (#12363)
* a
(cherry picked from commit 43909cdd6e3daf30a82d584292286806d1172a0b)
* optimize inplace accumulator a bit
* fix inputs
* revert logging
* minor fix
* tune perf and resolve comments
* typo
* fix
* fix tests
* move threshold to constexpr.