Ensure quantized::add stride matches inputs (#25265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25265
This ensures that the output strides match the input strides. Previously, we would degenerate down to slow scalar code because the call to _empty_affine_quantize would produce a tensor with different strides than the operands. When this mismatch occurs, TensorIterator uses the scalar code. This fixes that
Benchmark script:
```
import torch, time
x = torch.rand(1, 56, 56, 256)
y = torch.rand(1, 56, 56, 256)
qX = torch.quantize_linear(x, 0.1, 128, torch.quint8)
qY = torch.quantize_linear(y, 0.1, 128, torch.quint8)
s = time.time()
for i in range(1000):
x + y
print('float contig', time.time() - s)
s = time.time()
for i in range(1000):
torch.ops.quantized.add(qX, qY, 0.5, 1)
print('quantized contig', time.time() - s)
x = torch.rand(1, 56, 56, 256)
y = torch.rand(1, 56, 56, 256)
qX = torch.quantize_linear(x, 0.1, 128, torch.quint8).permute([0, 3, 1, 2])
qY = torch.quantize_linear(y, 0.1, 128, torch.quint8).permute([0, 3, 1, 2])
x = x.permute([0, 3, 1, 2])
y = y.permute([0, 3, 1, 2])
s = time.time()
for i in range(1000):
x + y
print('float strided', time.time() - s)
s = time.time()
for i in range(1000):
torch.ops.quantized.add(qX, qY, 0.5, 1)
print('quantized strided', time.time() - s)
```
Before this change
```
$ OMP_NUM_THREADS=1 python cmp.py
float contig 0.4625673294067383
quantized contig 1.8083674907684326
float strided 0.46366071701049805
quantized strided *8.30056643486023*
```
After this change
```
$ OMP_NUM_THREADS=1 python cmp.py
float contig 0.48703694343566895
quantized contig 2.0587124824523926
float strided 0.4711723327636719
quantized strided *2.0382332801818848*
```
Test Plan: Imported from OSS
Differential Revision: D17077811
Pulled By: jamesr66a
fbshipit-source-id: 25f52743081162122dfc9eb4bc39185d4cc4ba3b