pytorch
938bab0b - [PyTorch] Add int version of vectorized PrefixSum to Benchmark (#67865)

Commit
3 years ago
[PyTorch] Add int version of vectorized PrefixSum to Benchmark (#67865) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67865 - Add int version of vectorized PrefixSum - Use unaligned load/store instructions - Add exclusive scan version. "exclusive" means that the i-th input element is not included in the i-th sum. For details see https://en.cppreference.com/w/cpp/algorithm/exclusive_scan Test Plan: ``` buck build mode/opt-clang //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench OMP_NUM_THREADS=1 numactl -m 0 -C 5 \ ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench ``` For full benchmark results, see P465274613 ``` PrefixSumBench/LocalInt/64 57 ns 56 ns 12414048 GB/s=9.06239G/s PrefixSumBench/LocalInt/256 221 ns 221 ns 3160853 GB/s=9.28635G/s PrefixSumBench/LocalInt/1024 818 ns 817 ns 857922 GB/s=10.0235G/s PrefixSumBench/LocalInt/4096 3211 ns 3210 ns 217614 GB/s=10.2093G/s PrefixSumBench/LocalInt/16384 12806 ns 12804 ns 54805 GB/s=10.2364G/s PrefixSumBench/LocalInt/65536 51115 ns 51079 ns 13741 GB/s=10.2643G/s PrefixSumBench/LocalInt/262144 205974 ns 205912 ns 3401 GB/s=10.1847G/s PrefixSumBench/LocalInt/1048576 829523 ns 828859 ns 845 GB/s=10.1207G/s PrefixSumBench/LocalIntAVX2/64 45 ns 45 ns 15568113 GB/s=11.3549G/s PrefixSumBench/LocalIntAVX2/256 208 ns 208 ns 3371174 GB/s=9.86913G/s PrefixSumBench/LocalIntAVX2/1024 893 ns 892 ns 783154 GB/s=9.18629G/s PrefixSumBench/LocalIntAVX2/4096 3618 ns 3613 ns 193834 GB/s=9.06838G/s PrefixSumBench/LocalIntAVX2/16384 14416 ns 14411 ns 48564 GB/s=9.09543G/s PrefixSumBench/LocalIntAVX2/65536 57650 ns 57617 ns 12156 GB/s=9.09952G/s PrefixSumBench/LocalIntAVX2/262144 230855 ns 230612 ns 3035 GB/s=9.09386G/s PrefixSumBench/LocalIntAVX2/1048576 924265 ns 923777 ns 758 GB/s=9.08077G/s PrefixSumBench/LocalIntAVX512/64 23 ns 23 ns 24876551 GB/s=22.0697G/s PrefixSumBench/LocalIntAVX512/256 95 ns 95 ns 7387386 GB/s=21.556G/s PrefixSumBench/LocalIntAVX512/1024 435 ns 435 ns 1609682 GB/s=18.8425G/s PrefixSumBench/LocalIntAVX512/4096 1815 ns 1815 ns 385462 GB/s=18.0561G/s PrefixSumBench/LocalIntAVX512/16384 7479 ns 7476 ns 93660 GB/s=17.5335G/s PrefixSumBench/LocalIntAVX512/65536 30171 ns 29879 ns 23430 GB/s=17.5468G/s PrefixSumBench/LocalIntAVX512/262144 125805 ns 125631 ns 5570 GB/s=16.6929G/s PrefixSumBench/LocalIntAVX512/1048576 504216 ns 503983 ns 1384 GB/s=16.6446G/s PrefixSumBench/ExclusiveScanIntAVX512/64 23 ns 23 ns 30058295 PrefixSumBench/ExclusiveScanIntAVX512/256 101 ns 101 ns 7398498 PrefixSumBench/ExclusiveScanIntAVX512/1024 435 ns 434 ns 1403877 PrefixSumBench/ExclusiveScanIntAVX512/4096 1979 ns 1978 ns 354016 PrefixSumBench/ExclusiveScanIntAVX512/16384 7828 ns 7819 ns 89551 PrefixSumBench/ExclusiveScanIntAVX512/65536 31206 ns 31192 ns 22408 PrefixSumBench/ExclusiveScanIntAVX512/262144 130106 ns 130023 ns 5388 PrefixSumBench/ExclusiveScanIntAVX512/1048576 525515 ns 524976 ns 1244 ``` Reviewed By: navahgar, swolchok Differential Revision: D32011740 fbshipit-source-id: 7962de710bd588291dd6bf0c719f579c55f7c063
Author
Hao Lu
Parents
Loading