Refactor thread_reduce for better unrolling and vectorization in the future (#36014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36014
Benchmark on RTX2080Ti: 2.13ms vs 1.88ms
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark-refactor.ipynb
Test Plan: Imported from OSS
Differential Revision: D20927535
Pulled By: ngimel
fbshipit-source-id: b65b749b58cebe0751e4ec7e1cf359543c401580