onnxruntime
1942e40e - [ARM64] MatMulNBits: use neon instrinsics to convert between fp16 and fp32 (#22195)

Commit
289 days ago
[ARM64] MatMulNBits: use neon instrinsics to convert between fp16 and fp32 (#22195) ### Description For fp16 Atype, the fallback operation is convert the data to fp32 and calculate. Added neon intrinsics version to speed up the conversion. Store address alignment and loop unrolling have insignificant impact on latency so they are omitted. |Benchmark | Time | CPU | |--------------|---------------------------------------------|--------------------| |M_ConvertF16ToF32/baseline/real_time | 1076961 ns | 1083398 ns | |M_ConvertF16ToF32/aligned:0/real_time | 46785 ns | 46516 ns | |M_ConvertF16ToF32/aligned:1/real_time | 46631 ns | 46391 ns | |M_ConvertF16ToF32_unroll2/aligned:0/real_time | 44074 ns | 44392 ns | |M_ConvertF16ToF32_unroll2/aligned:1/real_time | 44726 ns | 45226 ns | |M_ConvertF32ToF16/baseline/real_time | 520109 ns | 527329 ns | |M_ConvertF32ToF16/aligned:0/real_time | 73610 ns | 74015 ns | |M_ConvertF32ToF16/aligned:1/real_time | 71557 ns | 71525 ns | |M_ConvertF32ToF16_unroll2/aligned:0/real_time | 64227 ns | 63374 ns | |M_ConvertF32ToF16_unroll2/aligned:1/real_time | 67428 ns | 67989 ns | ### Motivation and Context speed up fallback implementation of Fp16 MatMulNBits
Author
Parents
  • cmake
    • File
      onnxruntime_mlas.cmake
  • onnxruntime
    • contrib_ops/cpu/quantization
      • File
        matmul_nbits.cc
    • core/mlas/lib
      • File
        fp16_neon_common.cpp
      • File
        mlasi.h
      • File
        platform.cpp
      • File
        sqnbitgemm_kernel_neon_fp32.cpp
      • File
        sqnbitgemm_kernel_neon_int8.cpp
    • test/mlas
      • bench
        • File
          bench_fp16_neon_common.cpp
      • unittest
        • File
          test_sqnbitgemm_neon_fp16.cpp