Vectorize cpu tensor conversions (#80905)
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.
In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:
| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64 | 23.8 | 10.3 |
| float32 | 16.8 | 8.18 |
| float64 | 18.0 | 9.47 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80905
Approved by: https://github.com/ngimel