Improve performance for unary kernels using vml (#91963)
This gives some speedups for kernels implemented with `at::vml`:
- Make vml ops serial and use `TensorIterator.for_each` for better parallism
with discontiguous tensors
- Reduce buffer size for discontiguous data to 8 KiB to increase chance of
fitting in L1d cache, but is still wide enough to utilize AVX-512.
- Avoid a copy if only one of input and output is discontiguous
There is no change for contiguous tensors, but I see significant speedup for
the following benchmarks:
```
import torch
a = torch.randn(2*10**6, device="cpu")
%timeit a.view(100, 20000)[:,::2].sqrt()
%timeit a.view(200, 10000)[::2].sqrt()
```
For discontiguous last dimension I see a 27x speedup and for discontiguous
batch dimension I see an 8x speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91963
Approved by: https://github.com/jgong5