Port CPU Trace from TH to ATen (#47126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126
Context
-------
This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360.
I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it.
Benchmarks
----------
TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears.
```
import torch
shapes = [
[1, 1],
[100, 100],
[1000, 1000],
[10000, 10000],
[100000, 100000],
]
for shape in shapes:
x = torch.ones(shape)
%timeit x.trace()
Before:
1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
After:
2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Future work
-----------
Things that can be done after this PR:
- add complex tensor support
- Fix the type promotion discrepancy between CPU and CUDA
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D24683259
Pulled By: zou3519
fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb