pytorch
20f52cdd - [hpc]optimize the torch.cat cuda kernel (#44833)

Commit
4 years ago
[hpc]optimize the torch.cat cuda kernel (#44833) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44833 Current cat cuda kernel employs the pin memory to pass the tensor data. 1) It is much slower than passing through argument using constant memory 2) the H2D sometimes overlaps with other H2D in training, and thus generates some random delay and leads to desync issue. For small N, we actually saw 2X improvements. Test Plan: benchmark ``` ./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter all --device cuda ``` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 38.825 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 45.440 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 38.765 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 60.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 65.203 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 83.941 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0d50fc2440>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0d50fc2440>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 51.059 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f0d50fc2b90>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f0d50fc2b90>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 42.134 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f0b22b7e3b0>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f0b22b7e3b0>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 78.333 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e5f0>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e5f0>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 77.065 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f0b22b7e680>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f0b22b7e680>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 74.632 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f0b22b7e710>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f0b22b7e710>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 81.846 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 99.291 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 114.060 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 478.777 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e7a0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e7a0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 80.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e830>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e830>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 491.983 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e8c0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e8c0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 966.613 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e950>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e950>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 1500.133 ``` After optimization ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 22.168 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 33.430 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 19.884 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 48.082 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 53.261 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 71.294 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f837a135200>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f837a135200>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 40.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f837a135950>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f837a135950>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 32.666 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f82e50e2440>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f82e50e2440>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 67.003 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e24d0>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e24d0>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 67.035 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f82e50e2560>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f82e50e2560>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 63.803 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f82e50e25f0>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f82e50e25f0>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 69.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 98.327 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 112.363 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 478.224 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2680>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2680>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 63.269 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2710>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2710>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 470.141 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e27a0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e27a0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 966.668 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2830>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2830>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 1485.309 ``` Reviewed By: ngimel Differential Revision: D23727275 fbshipit-source-id: 171275ac541c649f7aeab0a2f8f0fea9486d0180
Author
Parents
Loading