move parallel_for/parallel_reduce common implementation to cpp (#26969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26969
template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.
After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004
Test Plan:
- Test perf/correctness as #26702;
- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```
Differential Revision: D17628089
Pulled By: ljk53
fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66