Never reuse accumulated gradients' buffers (#119334)
Since accumulate grad may steal the gradient's `c10::Storage`, we can't reuse the op otherwise the gradient will get overwritten. From benchmarks, using the inductor's codegen'd _empty_strided_cpu/cuda and assigning to it has lower overhead than deep copying the gradient and reusing its buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119334
Approved by: https://github.com/jansel
ghstack dependencies: #118817