Optimize the copy of BFloat16 to Float and Float to BFloat16 (#79685)
Optimize the copy of BFloat16 to Float and Float to BFloat16.
* Vectorize the copy of BFLoat16 <-> Float
* Use `at::internal::serial_for_each` instead of directly using `cpu_kernel_vec` as `cpu_kernel_vec` can't handle that input and output has different data types.
single socket (28cores):
```
before: torch.Size([10, 128, 10, 124]) bf16 -> fp32: 4.18e-05 ms; fp32 -> bf16: 5.04e-05 ms
torch.Size([10, 128, 30, 124]) bf16 -> fp32: 0.00011868 ms; fp32 -> bf16: 0.0001476 ms
after: torch.Size([10, 128, 10, 124]) bf16 -> fp32: 1.35e-05 ms; fp32 -> bf16: 1.97e-05 ms
torch.Size([10, 128, 30, 124]) bf16 -> fp32: 7.32e-05 ms; fp32 -> bf16: 5.70e-05 ms
```
single core:
```
before: torch.Size([10, 128, 10, 124]) bf16 -> fp32: 0.000848 ms; fp32 -> bf16: 0.00105 ms
torch.Size([10, 128, 30, 124]) bf16 -> fp32: 0.00269 ms; fp32 -> bf16: 0.00321 ms
after: torch.Size([10, 128, 10, 124]) bf16 -> fp32: 0.000370 ms; fp32 -> bf16: 0.000382 ms
torch.Size([10, 128, 30, 124]) bf16 -> fp32: 0.00153 ms; fp32 -> bf16: 0.00113 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79685
Approved by: https://github.com/malfet