[ROCm] revert cat operator performance work-around
revert d5ca53c9554fd63d1fd69e58416dbf17a7952af9 (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed.
`python -m pt.cat_test --tag_filter all --device cuda`
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 48.833
NEW Forward Execution Time (us) : 8.318
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 54.508
NEW Forward Execution Time (us) : 23.824
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.117
NEW Forward Execution Time (us) : 14.942
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.790
NEW Forward Execution Time (us) : 74.334
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
OLD Forward Execution Time (us) : 102.063
NEW Forward Execution Time (us) : 76.008
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 167.786
NEW Forward Execution Time (us) : 123.679
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda
OLD Forward Execution Time (us) : 98.320
NEW Forward Execution Time (us) : 67.436
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda
OLD Forward Execution Time (us) : 91.484
NEW Forward Execution Time (us) : 59.230
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda
OLD Forward Execution Time (us) : 109.569
NEW Forward Execution Time (us) : 76.557
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda
OLD Forward Execution Time (us) : 106.603
NEW Forward Execution Time (us) : 87.635
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda
OLD Forward Execution Time (us) : 106.693
NEW Forward Execution Time (us) : 88.902
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda
OLD Forward Execution Time (us) : 110.881
NEW Forward Execution Time (us) : 94.361
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
OLD Forward Execution Time (us) : 122.925
NEW Forward Execution Time (us) : 123.046
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
OLD Forward Execution Time (us) : 272.442
NEW Forward Execution Time (us) : 271.932
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
OLD Forward Execution Time (us) : 457.329
NEW Forward Execution Time (us) : 456.767
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda
OLD Forward Execution Time (us) : 117.688
NEW Forward Execution Time (us) : 87.133
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 873.764
NEW Forward Execution Time (us) : 865.075
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 1746.831
NEW Forward Execution Time (us) : 1730.252
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda
OLD Forward Execution Time (us) : 2619.303
NEW Forward Execution Time (us) : 2598.717
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.063
NEW Forward Execution Time (us) : 7.904
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.275
NEW Forward Execution Time (us) : 8.118
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.896
NEW Forward Execution Time (us) : 7.938
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 51.745
NEW Forward Execution Time (us) : 7.922
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.575
NEW Forward Execution Time (us) : 13.299
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda
OLD Forward Execution Time (us) : 52.090
NEW Forward Execution Time (us) : 8.015
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74129
Approved by: https://github.com/ngimel