Simplify copy kernel (#28428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.
Benchmark on CUDA:
```python
import torch
import timeit
import pandas
import itertools
from tqdm.notebook import tqdm
import math
print(torch.__version__)
print()
_10M = 10 * 1024 ** 2
d = {}
for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(), repeat=2)):
if from_ not in d:
d[from_] = {}
a = torch.empty(_10M, dtype=from_, device='cuda')
min_ = math.inf
for i in range(100):
torch.cuda.synchronize()
start = timeit.default_timer()
a.to(to)
torch.cuda.synchronize()
end = timeit.default_timer()
elapsed = end - start
if elapsed < min_:
min_ = elapsed
d[from_][to] = int(min_ * 1000 * 1000)
pandas.DataFrame(d)
```
original:
![image](https://user-images.githubusercontent.com/1032377/67623519-e3e6dd80-f7da-11e9-86ea-9cc9f237123b.png)
new:
![image](https://user-images.githubusercontent.com/1032377/67623527-fc56f800-f7da-11e9-82bd-dc1ff9821b68.png)
Test Plan: Imported from OSS
Differential Revision: D18170995
Pulled By: ezyang
fbshipit-source-id: 461b53641813dc6cfa872a094ae917e750c60759