cpu_kernel_vec: Hoist stride checks out of loop (#68962)
Summary:
`cpu_kernel_vec` does stride checks to determine whether to use the vectorized or scalar inner loop. Since it uses a 1d `for_each` loop, it re-does these stride checks after every loop over the inner dimension. For iterators with small inner dimensions, this means a significant proportion of the time may be spent just on stride checks.
This changes it to use a 2d loop so the stride checks are further amortized. With the below `copy_` benchmark, it saves 50% of the callgrind instruction count from 28.4 Million to 13.5 Million and 30% time speedup from 22.8 us to 16.4 us on my machine.
```
from torch.utils.benchmark import Timer
import timeit
timer = Timer(
stmt="b.copy_(a);",
setup="""
auto a = at::rand({10000, 8}, at::kComplexDouble).slice(0, 0, -1, 2);
auto b = at::empty_like(a);
""",
num_threads=1,
language='c++',
timer=timeit.default_timer
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68962
Reviewed By: mrshenli
Differential Revision: D32684191
Pulled By: ngimel
fbshipit-source-id: 582af038314a0f999f43669e66edace38ff8d2dc