[nnc][tests] Tests and benchmarks for computeSum (#60160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60160
Adds a few simple tests and benchmarks for the `computeSum` op
(equivalent to `at::sum`).
The benchmarks test 1D reduction and 2D row and column reduction. Performance
is in the ballpark of aten (14-15 GB/s) on my skylake devserver for all cases,
and occasionally better (e.g. 256k * 64 row reduction goes from 9 GB/s to 13).
Results (on my skylake-avx512, with turbo disabled):
```
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
Reduce1D/Torch/16777216 4746995 ns 4746722 ns 150 BYTES=14.1379G/s
Reduce1D/Naive/16777216 34063215 ns 34061388 ns 21 BYTES=1.97023G/s
Reduce1D/NativeRfactor/16777216 5057175 ns 5057167 ns 139 BYTES=13.2701G/s
Reduce1D/TeNaive/16777216 33868945 ns 33868851 ns 21 BYTES=1.98143G/s
Reduce1D/TeSplitTail/16777216 33902786 ns 33900436 ns 21 BYTES=1.97959G/s
Reduce1D/TeSplitMask/16777216 33922509 ns 33920604 ns 21 BYTES=1.97841G/s
Reduce1D/TeRfactorV1/16777216 5141150 ns 5141002 ns 135 BYTES=13.0537G/s
Reduce1D/Op/16777216 5140390 ns 5140091 ns 135 BYTES=13.056G/s
Reduce2DCol/Torch/8/2097152 12824403 ns 12823563 ns 55 BYTES=5.8874G/s
Reduce2DCol/Torch/64/262144 8306873 ns 8306743 ns 83 BYTES=8.20507G/s
Reduce2DCol/Torch/4096/4096 7992364 ns 7992239 ns 87 BYTES=8.3988G/s
Reduce2DCol/OpSchedule/8/2097152/0 4866144 ns 4865766 ns 138 BYTES=15.5161G/s
Reduce2DCol/OpSchedule/64/262144/0 36668978 ns 36666415 ns 19 BYTES=1.85885G/s
Reduce2DCol/OpSchedule/4096/4096/0 155862459 ns 155801266 ns 4 BYTES=430.839M/s
Reduce2DCol/OpSchedule/8/2097152/1 8067683 ns 8061117 ns 85 BYTES=9.36563G/s
Reduce2DCol/OpSchedule/64/262144/1 7496686 ns 7496562 ns 93 BYTES=9.09183G/s
Reduce2DCol/OpSchedule/4096/4096/1 5262821 ns 5262186 ns 131 BYTES=12.7562G/s
Reduce2DCol/OpSchedule/8/2097152/2 6237899 ns 6237210 ns 109 BYTES=12.1044G/s
Reduce2DCol/OpSchedule/64/262144/2 5258012 ns 5257655 ns 127 BYTES=12.9635G/s
Reduce2DCol/OpSchedule/4096/4096/2 5231686 ns 5228241 ns 132 BYTES=12.839G/s
Reduce2DCol/OpSchedule/8/2097152/3 11088573 ns 11087557 ns 62 BYTES=6.80921G/s
Reduce2DCol/OpSchedule/64/262144/3 5338843 ns 5338326 ns 127 BYTES=12.7676G/s
Reduce2DCol/OpSchedule/4096/4096/3 4311617 ns 4308102 ns 162 BYTES=15.5812G/s
Reduce2DRow/Torch/8/2097152 4642244 ns 4641794 ns 151 BYTES=14.4575G/s
Reduce2DRow/Torch/64/262144 4628311 ns 4628245 ns 151 BYTES=14.4999G/s
Reduce2DRow/Torch/4096/4096 4894012 ns 4893316 ns 143 BYTES=13.7177G/s
Reduce2DRow/Torch/262144/64 10469098 ns 10468027 ns 68 BYTES=6.51101G/s
Reduce2DRow/Hand/262144/64 5554380 ns 5554059 ns 126 BYTES=12.2716G/s
Reduce2DRow/OpSchedule/8/2097152/0 33890363 ns 33888931 ns 21 BYTES=1.98026G/s
Reduce2DRow/OpSchedule/64/262144/0 33901317 ns 33899436 ns 21 BYTES=1.97965G/s
Reduce2DRow/OpSchedule/4096/4096/0 33500358 ns 33498815 ns 21 BYTES=2.00381G/s
Reduce2DRow/OpSchedule/262144/64/0 13132231 ns 13131049 ns 53 BYTES=5.19056G/s
Reduce2DRow/OpSchedule/8/2097152/1 5200423 ns 5200025 ns 134 BYTES=12.9055G/s
Reduce2DRow/OpSchedule/64/262144/1 5204428 ns 5204327 ns 133 BYTES=12.8949G/s
Reduce2DRow/OpSchedule/4096/4096/1 8724355 ns 8723370 ns 80 BYTES=7.69488G/s
Reduce2DRow/OpSchedule/262144/64/1 1811861280 ns 1811352083 ns 1 BYTES=37.6279M/s
Reduce2DRow/OpSchedule/8/2097152/2 9169829 ns 9168946 ns 76 BYTES=7.31915G/s
Reduce2DRow/OpSchedule/64/262144/2 9159901 ns 9158560 ns 76 BYTES=7.32747G/s
Reduce2DRow/OpSchedule/4096/4096/2 9217398 ns 9215557 ns 76 BYTES=7.28391G/s
Reduce2DRow/OpSchedule/262144/64/2 10820450 ns 10818998 ns 66 BYTES=6.29979G/s
Reduce2DRow/OpSchedule/8/2097152/3 5227921 ns 5226544 ns 133 BYTES=12.84G/s
Reduce2DRow/OpSchedule/64/262144/3 5194362 ns 5194082 ns 133 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/4096/4096/3 5196080 ns 5195349 ns 134 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/262144/64/3 5235189 ns 5234728 ns 133 BYTES=13.0202G/s
```
ghstack-source-id: 131753875
Test Plan: these tests
Reviewed By: navahgar
Differential Revision: D29190420
fbshipit-source-id: 86246df82098da4f5493d6c4f34a40016d95a9f0