[aot autograd] avoid cloning some inputs unnecessarily when they dont require grad (#96342)
When constructing the joint graph, we normally have to clone any inputs that are mutated, so that we can pass in the original, pre-mutation inputs as leaves to autograd.
Previously, we were doing this for all mutated inputs - but we only need to do it for inputs that require gradients and participate in autograd.
Hopefully this should speed up code like batch norm - I think before this we were unnecessarily cloning the running stats during training.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96342
Approved by: https://github.com/albanD, https://github.com/ezyang