[inductor] fix benchmark call for inplace update (#103547)
Enabling coordinate descent tuning for a few models cause illegal memory access (or trigger a device assert before that). Command:
```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --amp --performance --training --inductor -d cuda --only CamemBert
```
It turns out that we can not benchmark this kernel: https://gist.github.com/shunting314/a78997f54b5751f2887f4576956036ce
Digging more, it shows that this kernel has a inplace argument that will be changed after running the kernel. Our benchmark API simply call a kernel multiple times. Since each run may have side effect. The previous calls may change the inplace argument in a way that fail following calls.
This PR clone those inplace arguments before each benchmark call. This can increase the time for each benchmark call. But this should not affect autotuning since we increase the equal amount of time for each tuning configs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103547
Approved by: https://github.com/jansel