In loop_wrapper, do not copy the passed-in functor (capture it by reference instead). (#15845)
Summary:
The overhead of the copy actually makes an appreciable difference when doing a lot of small reductions (i.e., when the reduced dimension is significantly smaller than the non-reduced dimensions.
```
x=torch.randn((1024,10,1024),dtype=torch.float64)
torch.set_num_threads(1)
%timeit x.std(1)
```
Before: 813.0 ms
After: 708.25 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15845
Differential Revision: D13603246
Pulled By: umanwizard
fbshipit-source-id: 020d224d76fcb8a0b55b75b0f2937e9508891beb