[Reland] dont clone args (#89766)
Reland of https://github.com/pytorch/pytorch/pull/89519.
Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning because of the 250mb cache clearing in triton benchmarking.
Reland bc previously we weren't accounting for inplace buffer reuse correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89766
Approved by: https://github.com/jansel