[quant] Make FakeQuant use REGISTER_DISPATCH (#33682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33682
Previously, there were two API's for CPU and CUDA. This change keeps one top level API, i.e `fake_quantize_per_tensor_affine` and `fake_quantize_per_channel_affine` and uses the device type to dispatch to different backends (CPU and CUDA).
CPU kernel implementation is in QuantizedOpKernels.cpp
CUDA kernel implementation is in fake_quantize_core.cu
Test Plan:
python test/test_fake_quant.py
Benchmark Results for CPU
FakeQuantize tensor of size (2, 256, 128, 128)
Before:
per tensor quant ms 9.905877113342285
per channel quant ms 74.93825674057007
After:
per tensor quant ms 6.028120517730713
per channel quant ms 44.91588592529297
Imported from OSS
Differential Revision: D20072656
fbshipit-source-id: 0424f763775f88b93380a452e3d6dd0c90cb814b