use avx2 for Add without broadcast and when inputs are uint8_t (#25098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25098
Use the same optimization we used for Sum operator in Add when broadcast is not used and inputs are uint8_t.
The optimization uses AVX2 instruction and use fp32 (instead of pure fixed point arithmetic). It does introduce numerical difference but only for minor cases like tie-breaking when rounding.
Test Plan: buck test caffe2/caffe2/quantization/server:elementwise_add_dnnlowp_op_test
Reviewed By: jianyuh
Differential Revision: D16985776
fbshipit-source-id: 8097503dd55f7d39857b3e4102db0f91327a6f55