Target 4096 blocks instead of split to large grid for large reduction (#35997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35997
When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.
Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb
On large tensor, it is: 1.37ms vs 1.25ms
Test Plan: Imported from OSS
Differential Revision: D20927533
Pulled By: ngimel
fbshipit-source-id: 40df52e439cc1c01cda66c6195b600f301c5e984