[Vulkan] Optimize cat operator for channel dimension (#67207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67207
Improved performance for `cat` operator for channel dimension:
* Improved when the input tensor's channel size is a multiple of 4.
* Add new test cases to cover this scenario
* Limitation: We can't mix up using shader and `vkCmdCopyImage` at the same time. The way we create the output texture is different between two so we can't cross unless we create the output texture every time. We consider using `vkCmdCopyImage` only if all input tensors' channel is a multiple of 4.
{F673815905}
Test Plan:
**Test Conditions**
* 3 input tensors with size `{3, 40, 221, 193}`
* Number of iteration: `1,000`
* Compare `Time` column (`CPU` column is only for CPU execution time)
* Flushes resources every 1 iteration since the input tensor size is big
* running vulkan_perf_test requires a separate diff (D31906379)
**Test build on Android**
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test
adb shell "/data/local/tmp/vulkan_perf_test"
```
**Test build on Mac**
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64
```
**Test result on Google Pixel 5**
a) Without using `vkCmdCopyImage` for multiples of 4 in channel dimension
```
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark (Without optimization for 4x channels) Time CPU Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 60.4 ms 14.1 ms 1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 24.1 ms 0.947 ms 1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 59.6 ms 14.0 ms 1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 5.98 ms 0.844 ms 5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 6.02 ms 0.845 ms 5000
```
b) With using `vkCmdCopyImage` for multiples of 4 in channel dimension
```
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark (With optimization for 4x channels) Time CPU Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 39.3 ms 13.3 ms 1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 16.4 ms 3.49 ms 1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 59.7 ms 14.1 ms 1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 3.93 ms 0.855 ms 5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 6.14 ms 0.852 ms 5000
```
* `{3,40,221,193}`: 60.4 ms -> 39.3 ms (34.93% faster)
* `{3,20,221,193}`: 24.1 ms -> 16.4 ms (31.95% faster)
* `{3,4,221,193}`: 5.98 ms -> 3.93 ms (34.28% faster)
{F674052795}
Reviewed By: SS-JIA
Differential Revision: D31781390
fbshipit-source-id: 42179d28ae461a9e247053bc9718f6b8c6c819e5