pytorch
cdc9b262 - [Vulkan] Optimize cat operator for channel dimension (#67207)

Commit
3 years ago
[Vulkan] Optimize cat operator for channel dimension (#67207) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67207 Improved performance for `cat` operator for channel dimension: * Improved when the input tensor's channel size is a multiple of 4. * Add new test cases to cover this scenario * Limitation: We can't mix up using shader and `vkCmdCopyImage` at the same time. The way we create the output texture is different between two so we can't cross unless we create the output texture every time. We consider using `vkCmdCopyImage` only if all input tensors' channel is a multiple of 4. {F673815905} Test Plan: **Test Conditions** * 3 input tensors with size `{3, 40, 221, 193}` * Number of iteration: `1,000` * Compare `Time` column (`CPU` column is only for CPU execution time) * Flushes resources every 1 iteration since the input tensor size is big * running vulkan_perf_test requires a separate diff (D31906379) **Test build on Android** ``` cd ~/fbsource buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test adb shell "/data/local/tmp/vulkan_perf_test" ``` **Test build on Mac** ``` cd ~/fbsource buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64 ``` **Test result on Google Pixel 5** a) Without using `vkCmdCopyImage` for multiples of 4 in channel dimension ``` Run on (8 X 1804.8 MHz CPU s) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------------------------------------------------- Benchmark (Without optimization for 4x channels) Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 60.4 ms 14.1 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 24.1 ms 0.947 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 59.6 ms 14.0 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 5.98 ms 0.844 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 6.02 ms 0.845 ms 5000 ``` b) With using `vkCmdCopyImage` for multiples of 4 in channel dimension ``` Run on (8 X 1804.8 MHz CPU s) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------------------------------------------------- Benchmark (With optimization for 4x channels) Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 39.3 ms 13.3 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 16.4 ms 3.49 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 59.7 ms 14.1 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 3.93 ms 0.855 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 6.14 ms 0.852 ms 5000 ``` * `{3,40,221,193}`: 60.4 ms -> 39.3 ms (34.93% faster) * `{3,20,221,193}`: 24.1 ms -> 16.4 ms (31.95% faster) * `{3,4,221,193}`: 5.98 ms -> 3.93 ms (34.28% faster) {F674052795} Reviewed By: SS-JIA Differential Revision: D31781390 fbshipit-source-id: 42179d28ae461a9e247053bc9718f6b8c6c819e5
Author
Parents
Loading