Copy quantize routine to vec256 (#25685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25685
This saves a bunch of dynamic linking/function call overhead
Benchmark script
```
import torch
import time
x = torch.rand(1, 256, 56, 56)
y = torch.rand(1, 256, 56, 56)
print('dtype', 'ms/iter (float)', 'ms/iter (quant)', 'quant / float', sep='\t')
for dtype in [torch.quint8, torch.qint8, torch.qint32]:
qX = torch.quantize_linear(x, 0.1, 5, dtype).permute([0, 3, 1, 2])
qY = torch.quantize_linear(y, 0.1, 5, dtype).permute([0, 3, 1, 2])
_x = x.permute([0, 3, 1, 2])
_y = y.permute([0, 3, 1, 2])
NITER = 1000
# Test float
s = time.time()
for i in range(NITER):
_x + _y
elapsed_float = time.time() - s
ms_per_iter_float = elapsed_float / NITER * 1000
# Test quantized
s = time.time()
for i in range(NITER):
torch.ops.quantized.add(qX, qY, 0.1, 5)
elapsed = time.time() - s
ms_per_iter = elapsed / NITER * 1000
print(str(dtype), ms_per_iter_float, ms_per_iter, ms_per_iter / ms_per_iter_float, sep='\t')
```
Before this change (DynDisp to AVX2)
```
dtype ms/iter (float) ms/iter (quant) quant / float
torch.quint8 0.47539472579956055 0.5174136161804199 1.0883873717996941
torch.qint8 0.46573758125305176 0.5322310924530029 1.1427703365080666
torch.qint32 0.47144651412963867 4.043398380279541 8.576579228174513
```
After this change (DynDisp to AVX2)
```
dtype ms/iter (float) ms/iter (quant) quant / float
torch.quint8 0.48140883445739746 0.3396260738372803 0.705483675263412
torch.qint8 0.4651052951812744 0.3467671871185303 0.7455670591395397
torch.qint32 0.4986207485198975 4.015796899795532 8.053810259031533
```
Test Plan: Imported from OSS
Differential Revision: D17199438
Pulled By: jamesr66a
fbshipit-source-id: d518500c2b5f4e3a202d9ebc2a5862b4062ef118