Vectorize conversions to BFloat16 on CPU (#80906)
This adds explicit vectorization for converting float or double to
bfloat16. Most conversions are sufficiently handled by the
auto-vectorizer, but these conversions aren't (presumably due to
branching in the scalar conversion code).
Benchmark results with 512K elements on an AVX2 machine:
| conversion | Before (us) | After (us) |
|---------------------|-------------|------------|
| float32 -> bfloat16 | 53.3 | 39.8 |
| float64 -> bfloat16 | 92.1 | 78.2 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80906
Approved by: https://github.com/ngimel