add BFloat16 support for topk on CPU (#59547)
Summary:
Added BFloat16 support for topk on CPU, and collected the benchmark data of topk for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
Input: 512x512, 512x1024, 1024x512, 1024x1024
K: 5
Number of cores: 1 core, 28 cores(1 socket)
For 1 core:
----------------------------------------
PyTorch/Caffe2 Operator Micro-benchmarks
----------------------------------------
Tag : all
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W512_k5_dtypetorch.float32_cpu
Input: H: 512, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 911.401
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W512_k5_dtypetorch.bfloat16_cpu
Input: H: 512, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 911.700
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W1024_k5_dtypetorch.float32_cpu
Input: H: 512, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 1506.927
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W1024_k5_dtypetorch.bfloat16_cpu
Input: H: 512, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 1492.036
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W512_k5_dtypetorch.float32_cpu
Input: H: 1024, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 1825.634
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W512_k5_dtypetorch.bfloat16_cpu
Input: H: 1024, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 1819.872
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W1024_k5_dtypetorch.float32_cpu
Input: H: 1024, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 3001.459
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W1024_k5_dtypetorch.bfloat16_cpu
Input: H: 1024, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 2970.718
For 28 cores(1 socket):
----------------------------------------
PyTorch/Caffe2 Operator Micro-benchmarks
----------------------------------------
Tag : all
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W512_k5_dtypetorch.float32_cpu
Input: H: 512, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 146.995
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W512_k5_dtypetorch.bfloat16_cpu
Input: H: 512, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 123.423
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W1024_k5_dtypetorch.float32_cpu
Input: H: 512, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 105.967
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H512_W1024_k5_dtypetorch.bfloat16_cpu
Input: H: 512, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 101.498
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W512_k5_dtypetorch.float32_cpu
Input: H: 1024, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 128.023
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W512_k5_dtypetorch.bfloat16_cpu
Input: H: 1024, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 125.172
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W1024_k5_dtypetorch.float32_cpu
Input: H: 1024, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 129.855
Benchmarking PyTorch: topk
Mode: Eager
Name: topk_H1024_W1024_k5_dtypetorch.bfloat16_cpu
Input: H: 1024, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 124.556
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59547
Reviewed By: mrshenli
Differential Revision: D29763916
Pulled By: ezyang
fbshipit-source-id: 706c7d4349ac9ebd5d63f4844fca70febcb67023