enable bf16 vec copy (#54671)

Commit

3 years ago

enable bf16 vec copy (#54671) Summary: Enable bf16 vectorized copy. BFloat16's copy get 2x performance for fp32 as our expectation. BFloat16's vec copy dose not show performance gain compare with scalar version with op benchmark. This should caused by the memory system of operator. The system will really "read/write" a scalar at one time, although the code is written in scalar version. benchmarks code: ``` import torch import torch.utils.benchmark as benchmark # x = torch.empty(10 * 18304 * 1024 * 16, dtype=torch.bfloat16) x = torch.empty(10 * 18304 * 1024 * 16, dtype=torch.float) def copy(tensors): for t in tensors: x.copy_(t) tensors = [] for i in range(2): # l3 cache size 36608k = 18304 bfloat16 * 2 byte(per bfloat16) # tensors.append(torch.rand(10 * 18304 * 1024 * 16).bfloat16()) tensors.append(torch.rand(10 * 18304 * 1024 * 16)) t0 = benchmark.Timer( stmt='copy(tensors)', setup='from __main__ import copy', globals={'tensors': tensors}, num_threads=1) print(t0.timeit(20)) ``` Before this comit: fp32: 3.84 s 1 measurement, 20 runs , 1 thread bf16: 1.89 s 1 measurement, 20 runs , 1 thread After: fp32: 3.71 s 1 measurement, 20 runs , 1 thread bf16: 1.85 s 1 measurement, 20 runs , 1 thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/54671 Reviewed By: ailzhang Differential Revision: D27325350 Pulled By: heitorschueroff fbshipit-source-id: 1a3b8ca17b4c60dbb3e86bf196f63e0a05228c65

Author

haozhe.zhu

Committer

facebook-github-bot

Parents

0527d142

pytorch 01b15570 - enable bf16 vec copy (#54671)

pytorch
01b15570 - enable bf16 vec copy (#54671)