TensorIterator: Avoid nesting two levels of function_ref in for_each (#53613)
Summary:
When calling `TensorIterator::for_each` with a 1d loop, it creates a `function_ref` for the 1D iteration, then wraps it with `LOOP_WRAPPER` to transform it into a 2d loop. That 2d loop then gets wrapped in another `function_ref`. This can result in significant overhead if the 1d inner loop is over a small number of elements.
Instead, this wraps the 1d loop before type-erasure so only one level of `function_ref` is introduced. A simple benchmark demonstrates this is a win:
```python
import torch
a = torch.rand((10000, 2))[::2]
%timeit a + a
```
Note the 2D tensor cannot be coalesced into 1D and both `cpu_kernel` and `cpu_kernel_vec` use 1D for_each. On master, this takes 42 us but with this change it's down to 32us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53613
Reviewed By: VitalyFedyunin
Differential Revision: D26947143
Pulled By: ezyang
fbshipit-source-id: 5189ada0d82bbf74170fb446763753f02478abf6