[Autograd] Use in-place input accumulation fast path for dense Tensors. (#88339)
There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.
Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88339
Approved by: https://github.com/ngimel