[pt][quant] Vectorized qmul and more methods on qint data types (#34376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34376
Vectorized implementation of qmul. qmul is now ~16x faster on my development machine. This implementation works for qint8, quint8 and qint32. Also added some commonly used operations, such as multiply operator, requantize operation etc., to qint vector classes for future use.
```
#!/usr/bin/env python
import time
import torch
import torch.nn as nn
torch.set_num_threads(1)
# print(torch.__config__.parallel_info())
A = torch.rand(1, 54, 54, 256)
B = torch.rand(1, 54, 54, 256)
scale = .05
zero_point = 50
for dtype in [torch.quint8, torch.qint8]:
qA = torch.quantize_per_tensor(A, scale=scale, zero_point=zero_point,
dtype=dtype)
qB = torch.quantize_per_tensor(B, scale=scale, zero_point=zero_point,
dtype=dtype)
NITER = 1000
s = time.time()
for i in range(NITER):
out = torch.ops.quantized.mul(qA, qB, scale=scale, zero_point=zero_point)
time_per_iter = (time.time() - s) / NITER
print('dtype: {} time per iter ms: {:.3f}'.format(dtype, time_per_iter * 1000))
```
### Before
dtype: torch.quint8 time per iter ms: 6.714
dtype: torch.qint8 time per iter ms: 6.780
### After
dtype: torch.quint8 time per iter ms: 0.431
dtype: torch.qint8 time per iter ms: 0.417
### Test
Modified qmul tests to include qint8 and qint32 data types.
python test/test_quantized.py TestQuantizedOps.test_qmul_relu_same_qparams
python test/test_quantized.py TestQuantizedOps.test_qmul_relu_different_qparams
python test/test_quantized.py TestQuantizedOps.test_qmul_broadcast
ghstack-source-id: 99862681
Differential Revision: D20308515
fbshipit-source-id: 4fa65b2ba433cfd59260fc183a70f53a6fcc36b4