Dont clone unmutated args in triton autotuning (#89519)
Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning. Any other pointers on where the overhead is coming from in autotuning would be great.
Edit: i think it's just the triton cache clearing https://github.com/openai/triton/blob/44f577984d28ee979f704e2c28a1dcbac9639840/python/triton/testing.py#L159
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89519
Approved by: https://github.com/ngimel, https://github.com/jansel