onnxruntime
Mlas int4 int8 with avx2/512
#20687
Merged

Commits
  • quick adapt llama.cpp to experiment performance. Only works with blklen32, symmetric1 hasBias0 Int8
    liqunfu committed 1 year ago
  • fire
    liqunfu committed 1 year ago
  • tile 2x4 SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1542487160 ns 1539062500 ns
    liqunfu committed 1 year ago
  • use one_16_epi16 and accumulate_2blk_dot: SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1434872720 ns
    liqunfu committed 1 year ago
  • apply to M1, BQuant layout pack block (subblk) larger than blklen: SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1265060620 ns 1265625000 ns
    liqunfu committed 1 year ago
  • use new AQuant layout (not work if total M is not RangeCountM): SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1214042220 ns
    liqunfu committed 1 year ago
  • apply blksum to blklen32 and 64: SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 784668090 ns; SQNBITGEMM<4>/BlkLen:64/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 754939430 ns
    liqunfu committed 1 year ago
  • blklen16
    liqunfu committed 1 year ago
  • impl avx512: SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 664029830 ns
    liqunfu committed 1 year ago
  • matmul_nbit & fix alignment for sgemm
    liqunfu committed 1 year ago
  • merge main
    liqunfu committed 1 year ago
  • fix mlas benchmark not using multi threads
    liqunfu committed 1 year ago
  • profiling
    liqunfu committed 1 year ago
  • Merge branch 'liqun/mlas-q4-tile-avx' of https://github.com/microsoft/onnxruntime into liqun/mlas-q4-tile-avx
    liqunfu committed 1 year ago
  • sgemm after sq4bit for avx2
    liqunfu committed 1 year ago
  • avx512
    liqunfu committed 1 year ago
  • layout to follow compute, M1 separate with M > 1
    liqunfu committed 1 year ago
  • make avx512 run
    liqunfu committed 1 year ago
  • Merge branch 'main' into liqun/mlas-q4-tile-avx
    liqunfu committed 1 year ago
  • avx512 blklen64 pass
    liqunfu committed 1 year ago
  • pass avx512 blklen32
    liqunfu committed 1 year ago
  • pass avx512 blklen 16, 128, 256
    liqunfu committed 1 year ago
  • pass fp32, refactor sqnbitgemm
    liqunfu committed 1 year ago
  • merge main
    liqunfu committed 1 year ago
  • avx512vnni
    liqunfu committed 1 year ago
  • merge main
    liqunfu committed 1 year ago
  • avxvnni
    liqunfu committed 1 year ago
  • rm unused ComputeParallelTasksSGemm
    liqunfu committed 1 year ago
  • avoid _mm256_dpbusds_avx_epi32 in avx512vnni
    liqunfu committed 1 year ago
  • fix linux build
    liqunfu committed 1 year ago
  • Merge branch 'main' into liqun/mlas-q4-tile-avx
    liqunfu committed 1 year ago
  • refactor for Arm64
    liqunfu committed 1 year ago
  • more refactor for Arm64
    liqunfu committed 1 year ago
  • hsum_float_16
    liqunfu committed 1 year ago
  • hsum_float_16
    liqunfu committed 1 year ago
  • condition for -mavxvnni
    liqunfu committed 1 year ago
  • CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 10
    liqunfu committed 1 year ago
  • missed 2 files from (__GNUC__ > 10)
    liqunfu committed 1 year ago
  • missed _mm256_dpbusds_avx_epi32 and print out cmake msgs
    liqunfu committed 1 year ago
  • unused zp, etc.
    liqunfu committed 1 year ago
  • unused zp, etc.
    liqunfu committed 1 year ago
  • remove test code changes
    liqunfu committed 1 year ago
  • remove test code changes
    liqunfu committed 1 year ago
  • lint
    liqunfu committed 1 year ago
  • lint
    liqunfu committed 1 year ago
  • code name
    liqunfu committed 1 year ago
  • update reviewers' comments
    liqunfu committed 1 year ago
  • Merge branch 'main' into liqun/mlas-q4-tile-avx
    liqunfu committed 1 year ago
Loading