Enabling intra-op parallelism for dynamic quantized Linear operator (#28477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28477
Similar to https://github.com/pytorch/pytorch/pull/26692, we would like to enable the intra-op parallelism for dynamic Linear op.
ghstack-source-id: 92419573
Test Plan:
CI
Test Benchmark:
```
import time
import torch
K, N = 1024, 1024
print('M', 'nthread=1', 'nthread=2', 'nthread=4', 'nthread=8', 'nthread=16', sep=', ')
for M in range(512, 2049, 512):
print(M, sep=',', end=', ')
for num_threads in (1, 2, 4, 8, 16,):
torch.set_num_threads(num_threads)
x = torch.rand(M, K)
w = torch.rand(K, N)
NITER = 20
# Test dynamic quantized
q_w = torch.quantize_per_tensor(w, 0.01, 0, dtype=torch.qint8)
packed_w = torch.ops.quantized.linear_prepack(q_w, None)
s = time.time()
for i in range(NITER):
torch.ops.quantized.linear_dynamic(x, packed_w)
elapsed_per_iter_dyn_quant = (time.time() - s) / NITER
print("{:0.2f}".format(2.0*M*N*K/elapsed_per_iter_dyn_quant/1E9), end=', ')
print("\n", end='')
```
Before this Diff:
```
(base) [root@rtptest10054.frc2 ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 119.28, 139.50, 141.66, 141.58, 141.42,
1024, 122.42, 141.21, 123.09, 141.85, 123.03,
1536, 122.80, 122.18, 141.39, 123.25, 141.35,
2048, 123.41, 141.34, 123.62, 140.55, 123.76,
```
After this Diff:
```
(base) [root@rtptest10054.frc2 ~/jhuang_test/dynamic_quant]# python benchmark_quantize_dynamic.py
M, nthread=1, nthread=2, nthread=4, nthread=8, nthread=16
512, 123.29, 271.99, 508.66, 882.83, 1295.07,
1024, 126.05, 273.15, 515.42, 914.11, 877.63,
1536, 142.48, 236.85, 524.10, 481.32, 970.81,
2048, 124.76, 279.03, 433.73, 958.67, 1045.82,
```
Differential Revision: D18074757
fbshipit-source-id: ad5b43477d2187c818c137093c6d6af02d5ca1d5