[StaticRuntime] Permute_out (#49447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49447
Adding an out variant for `permute`. It's better than fixing the copy inside contiguous because 1) we can leverage the c2 math library, 2) contiguous creates a tensor inside the function which isn't managed by the MemoryPlanner in StaticRuntime
Test Plan:
Benchmark:
```
After:
I1214 12:35:32.218775 991920 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0902339. Iters per second: 11082.3
Before:
I1214 12:35:43.368770 992620 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0961521. Iters per second: 10400.2
```
Reviewed By: yinghai
Differential Revision: D25541666
fbshipit-source-id: 013ed0d4080cd01de4d3e1b031ab51e5032e6651