Use more efficient specialized Quantize routine (#25731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25731
I didn't notice this before, but the QuantizeAvx2 routine was requantizing only a single vector of 8 floats into 1/4 of a 256-bit int8 register. This switches it to use a specialization that goes from 4 float vectors into a whole int8 vector, borrowed from C2
Test Plan: Imported from OSS
Differential Revision: D17214413
Pulled By: jamesr66a
fbshipit-source-id: 1d6fc556e43739e9a4b0dba5df2332beb1b3795b