enable bf16 vec copy (#54671)
Summary:
Enable bf16 vectorized copy.
BFloat16's copy get 2x performance for fp32 as our expectation.
BFloat16's vec copy dose not show performance gain compare with scalar version with op benchmark. This should caused by the memory system of operator. The system will really "read/write" a scalar at one time, although the code is written in scalar version.
benchmarks code:
```
import torch
import torch.utils.benchmark as benchmark
# x = torch.empty(10 * 18304 * 1024 * 16, dtype=torch.bfloat16)
x = torch.empty(10 * 18304 * 1024 * 16, dtype=torch.float)
def copy(tensors):
for t in tensors:
x.copy_(t)
tensors = []
for i in range(2):
# l3 cache size 36608k = 18304 bfloat16 * 2 byte(per bfloat16)
# tensors.append(torch.rand(10 * 18304 * 1024 * 16).bfloat16())
tensors.append(torch.rand(10 * 18304 * 1024 * 16))
t0 = benchmark.Timer(
stmt='copy(tensors)',
setup='from __main__ import copy',
globals={'tensors': tensors},
num_threads=1)
print(t0.timeit(20))
```
Before this comit:
fp32:
3.84 s
1 measurement, 20 runs , 1 thread
bf16:
1.89 s
1 measurement, 20 runs , 1 thread
After:
fp32:
3.71 s
1 measurement, 20 runs , 1 thread
bf16:
1.85 s
1 measurement, 20 runs , 1 thread
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54671
Reviewed By: ailzhang
Differential Revision: D27325350
Pulled By: heitorschueroff
fbshipit-source-id: 1a3b8ca17b4c60dbb3e86bf196f63e0a05228c65