avx2 using intrinsic, enable binary broadcasting parallel (#4216)

Commit

5 years ago

qlinaradd for arm/sse2/avx2 using intrinsic, enable binary broadcasting parallel (#4216) * Support quantization linear binary element wise math ops, implement QLinearAdd. Support tests for quantization linear binary element wise math ops, implement test for QLinearAdd. Add QlinearAdd with SSE2 intrisinc implemntation, Avx2 assembly implemntation, Neon intrisinc support. QLinearAdd support VectorOnVector, VectorOnScalar, ScalarOnVector. Generalized QlinearBinaryOp parallel related with broadcasting. * Modify according to PR feedbacks. Mainly: * template helper for generalize the qladd logic on v2v, s2v, v2s * remove GetKernel related. * change mixed lagecy MM/SSE code in the AVX code * formater, typos, convensions, etc. * Utilize MlasSubtractInt32x4 in MlasDequantizeLinearVector(). * Some format fix. * More nature parallel parameter type. * Fix build break for x86. * Comment goes to 80 before wrap. * Many change on assembly on Marco related. Using vminps than vpminsd to handle NaN. tested on windows. * Using CLang Format to format the file. * Fix arm32 build error. * Remove some duplicate in different #if defined * working add.u8.vector to vector * Fix runtime bus error on real arm32 linux. * fix typo in store last one lane. * arm32 qlinearadd handle scalar. * Move qladd to seperate c++ file * Add neon64 qladd. * refactor some, enhance two instructions on arm64 only instructions * Fix typo for arm64 * use strict op in pure c++ (min/max on float value) * sse2 new version. * mrege arm/sse2/avx2 * pass arm/sse/avx2 linux test * remove non-used assembly file. * Remove unused data definition and tailing spaces. * Fix broadcasting parallel issue. * Enhance broadcasting scenarios. Allow testing result diff due to round on half. * Add Mlas or MLAS_ prefix for namespace safety. * Handle alignment issue for arm32 for GCC/MSVC. remove some unused signed/unsigned int ops. * Specify /arch:AVX2 for qladd_avx2.cpp * Fix type during copy/paste when unrolling. Better one GreatEqual condition. Better formater by splitting two statements on single line. * Arm neon alignment parameter is bits rather than bytes, change it. * Move qladd_avx2.cpp to intrinsics/avx2/ folder * Formatting using mlas style. * Double check mlas style for these files. * change indent 2 to 4 for qladd_avx2.cpp * Fix windows x86 build error due to sse2 no _mm_cvtsi128_si64 * To re-trigger all as old failed pipeline updated. Co-authored-by: Lei Zhang <phill.zhang@gmail.com>

References

#4216 - qlinaradd for arm/sse2/avx2 using intrinsic, enable binary broadcasting parallel

Author

zhanghuanrong

Parents

49268c42

onnxruntime 94c98aa0 - qlinaradd for arm/sse2/avx2 using intrinsic, enable binary broadcasting parallel (#4216)

onnxruntime
94c98aa0 - qlinaradd for arm/sse2/avx2 using intrinsic, enable binary broadcasting parallel (#4216)