[MicroBench] Added a micro benchmark for prefix sum (#65790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65790
Here are the results of the benchmark:
* ATen - version that calls `at::cumsum`
* NNC - a simple prefix-sum loop implemented in NNC (not vectorized)
* Local - a C++ implementation of the simple prefix-sum loop
* LocalAVX2 - a vectorized C++ implementation of prefix-sum, only using AVX2
* LocalAVX512 - a vectorized C++ implementation of prefix-sum, using AVX512.
The vectorized implementations are from the paper "Parallel Prefix Sum with SIMD" in ADMS' 20.
```
$ OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench
Run on (36 X 1601 MHz CPU s)
2021-09-28 23:13:12
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
PrefixSumBench/ATen/64 1289 ns 1289 ns 543199 GB/s=397.069M/s
PrefixSumBench/ATen/256 1867 ns 1867 ns 374232 GB/s=1096.8M/s
PrefixSumBench/ATen/1024 4169 ns 4169 ns 167889 GB/s=1.9649G/s
PrefixSumBench/ATen/4096 14137 ns 14136 ns 49266 GB/s=2.31806G/s
PrefixSumBench/ATen/16384 49887 ns 49883 ns 13988 GB/s=2.6276G/s
PrefixSumBench/ATen/65536 193742 ns 193686 ns 3628 GB/s=2.7069G/s
PrefixSumBench/ATen/262144 764803 ns 764774 ns 917 GB/s=2.74219G/s
PrefixSumBench/ATen/1048576 3040653 ns 3040277 ns 231 GB/s=2.75916G/s
PrefixSumBench/Local/64 586 ns 586 ns 1197003 GB/s=873.244M/s
PrefixSumBench/Local/256 1077 ns 1077 ns 646265 GB/s=1.90143G/s
PrefixSumBench/Local/1024 3050 ns 3050 ns 229458 GB/s=2.68579G/s
PrefixSumBench/Local/4096 11910 ns 11910 ns 58953 GB/s=2.75132G/s
PrefixSumBench/Local/16384 43204 ns 43202 ns 16081 GB/s=3.03393G/s
PrefixSumBench/Local/65536 167966 ns 167966 ns 4154 GB/s=3.12139G/s
PrefixSumBench/Local/262144 667631 ns 667613 ns 1048 GB/s=3.14127G/s
PrefixSumBench/Local/1048576 2654785 ns 2654631 ns 264 GB/s=3.15999G/s
PrefixSumBench/NNC/64 642 ns 642 ns 1095277 GB/s=797.442M/s
PrefixSumBench/NNC/256 1139 ns 1138 ns 617214 GB/s=1.799G/s
PrefixSumBench/NNC/1024 3103 ns 3103 ns 225531 GB/s=2.63979G/s
PrefixSumBench/NNC/4096 12053 ns 12052 ns 58084 GB/s=2.71883G/s
PrefixSumBench/NNC/16384 43227 ns 43225 ns 16192 GB/s=3.03231G/s
PrefixSumBench/NNC/65536 168065 ns 168056 ns 4153 GB/s=3.11972G/s
PrefixSumBench/NNC/262144 668974 ns 668921 ns 1045 GB/s=3.13513G/s
PrefixSumBench/NNC/1048576 2657464 ns 2657341 ns 263 GB/s=3.15677G/s
PrefixSumBench/LocalAVX2/64 523 ns 523 ns 1351308 GB/s=979.537M/s
PrefixSumBench/LocalAVX2/256 755 ns 755 ns 927762 GB/s=2.71159G/s
PrefixSumBench/LocalAVX2/1024 1759 ns 1759 ns 400355 GB/s=4.65609G/s
PrefixSumBench/LocalAVX2/4096 6708 ns 6706 ns 103959 GB/s=4.88649G/s
PrefixSumBench/LocalAVX2/16384 22143 ns 22142 ns 31229 GB/s=5.91951G/s
PrefixSumBench/LocalAVX2/65536 83649 ns 83642 ns 8350 GB/s=6.26828G/s
PrefixSumBench/LocalAVX2/262144 330433 ns 330427 ns 2133 GB/s=6.34679G/s
PrefixSumBench/LocalAVX2/1048576 1302301 ns 1302179 ns 537 GB/s=6.44198G/s
PrefixSumBench/LocalAVX512/64 474 ns 474 ns 1459151 GB/s=1080.8M/s
PrefixSumBench/LocalAVX512/256 576 ns 576 ns 1217442 GB/s=3.55524G/s
PrefixSumBench/LocalAVX512/1024 994 ns 994 ns 703387 GB/s=8.24434G/s
PrefixSumBench/LocalAVX512/4096 3642 ns 3641 ns 190646 GB/s=8.99857G/s
PrefixSumBench/LocalAVX512/16384 10140 ns 10140 ns 68947 GB/s=12.9267G/s
PrefixSumBench/LocalAVX512/65536 35739 ns 35736 ns 19567 GB/s=14.6711G/s
PrefixSumBench/LocalAVX512/262144 156415 ns 156413 ns 4467 GB/s=13.4078G/s
PrefixSumBench/LocalAVX512/1048576 613952 ns 613876 ns 1144 GB/s=13.665G/s
```
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D31253849
Pulled By: navahgar
fbshipit-source-id: f33e7be787c86a09e90babddd66b16e2e0777eb4