Remove one unnecessary copy of the output during the type promotion. (#26816)
Summary:
Output tensors doesn't need to be copied during type promotion as we are not using any data from them. Simple allocation gives steady 10% performance gain.
BEFORE
```
In [1]: x = torch.randn(64, 2048, 7,7)
In [2]: y = torch.randn(64, 2048, 7,7, dtype=torch.float64)
In [3]: timeit x.add_(y)
77.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
AFTER
```
In [1]: x = torch.randn(64, 2048, 7,7)
In [2]: y = torch.randn(64, 2048, 7,7, dtype=torch.float64)
In [3]: timeit x.add_(y)
68.2 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26816
Differential Revision: D17573455
Pulled By: VitalyFedyunin
fbshipit-source-id: 47286abce5e7e665eb61e46ae358c896e945bef2