[caffe2] optimize 2/4-bit row-wise quantization (#387)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/387
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39985
avx2 optimized 2/4-bit row-wise quantization/dequantization in perfkernels.
This diff slightly change the numerics of quantization by multiplying with the inverse of scale instead of dividing with scale.
Test Plan:
In my devserver
for i in 2 4 8; do echo $i; buck run mode/opt :fused_rowwise_nbit_conversion_bench -- --bit-rate=$i; done
Before this diff
2-bit
3.35394 ms. 100%. FloatToFused2BitRowwiseQuantized
4-bit
3.60351 ms. 100%. FloatToFused4BitRowwiseQuantized
8-bit
0.434467 ms. 100%. FloatToFused8BitRowwiseQuantized
After this diff
2-bit
0.606386 ms. 100%. FloatToFused2BitRowwiseQuantized
4-bit
0.446683 ms. 100%. FloatToFused4BitRowwiseQuantized
8-bit
0.4349 ms. 100%. FloatToFused8BitRowwiseQuantized
Reviewed By: choudharydhruv, jianyuh
Differential Revision: D22033195
fbshipit-source-id: d3a219e47b8345268d90a160c9314ed0d5b71467