speed up quantized relu6 inplace kernel (#68404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68404
The qclamp kernel is equal to (non inplace) or faster (inplace) than the
qrelu6 kernel. Removing the qrelu6 kernel and routing qrelu6 to the
qclamp kernel instead.
Test Plan:
```
// correctness
python test/test_quantization.py TestQuantizedOps.test_qrelu6
// benchmarking
import torch
import torch.nn.functional as F
toq = torch.ops.quantized
import time
N_WARMUP = 5
N_ITER = 1000
data = torch.randn(32, 32, 64, 64)
data = torch.quantize_per_tensor(data, 0.05, 0, torch.quint8)
for _ in range(N_WARMUP):
F.hardtanh(data, 0., 6., inplace=True)
t1 = time.time()
for _ in range(N_ITER):
F.hardtanh(data, 0., 6., inplace=True)
t2 = time.time()
for _ in range(N_WARMUP):
toq.relu6(data, inplace=True)
t3 = time.time()
for _ in range(N_ITER):
toq.relu6(data, inplace=True)
t4 = time.time()
t_hardtanh = t2 - t1
t_qrelu6 = t4 - t3
print(t_hardtanh, t_qrelu6)
// before
0.7156341075897217 1.4007949829101562
// after
0.6825599670410156 0.6571671962738037
```
Reviewed By: jerryzh168
Differential Revision: D32463754
Pulled By: vkuzo
fbshipit-source-id: a87fe5907d7b71d87eb1d5f6588cd509a88f2969