[PyTorch][Vulkan] Add a matrix multiplication performance test binary and fix GPU latency measurement (#108266)
Summary:
- Added a new matmul perf test binary as target `pt_vulkan_mm_perf_test_bin`
- Also renamed the existing `vulkan_perf_test_bin` to `vulkan_conv_arithmetic_perf_test_bin` with associated source file name change
- **Fixed the manual time benchmark measurement for both performance binaries, which was not tracking the correct opnames (e.g. checked for runtime of nonexistent "mm" instead of "vulkan.mm")**
Test Plan:
# pt_vulkan_mm_perf_test_bin
- build the matrix multiplication performance test binary
```
~/fbsource » buck2 build -c ndk.debug_info_level=0 -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid --show-output -c pt.vulkan_full_precision=1
[...]
BUILD SUCCEEDED
fbsource//xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid
```
- test on arm32 android device
```
~/fbsource » adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid /data/local/tmp/
~/fbsource » adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
```
- output P817269023, excerpt below
```
Kernel Name Workgroup Size Duration (ns)
=========== ============== ===========
vulkan.nchw_to_image {500, 500, 1} 4336072
vulkan.nchw_to_image {250, 250, 1} 1106716
vulkan.nchw_to_image {1, 1, 1} 7228
vulkan.mm {250, 250, 1} 132570256
[...]
vulkan.mm {250, 250, 1} 80492152
vulkan.image_to_nchw {500, 500, 1} 1420328
-------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------------------------------------
mm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1 91047 ms 143 ms 5
```
# pt_vulkan_conv_arithmetic_perf_test_bin
- build the convolution and arithmetic performance test binary
```
~/fbsource » buck2 build -c ndk.debug_info_level=0 -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_conv_arithmetic_perf_test_binAndroid --show-output -c pt.vulkan_full_precision=1
[...]
BUILD SUCCEEDED
fbsource//xplat/caffe2:pt_vulkan_conv_arithmetic_perf_test_binAndroid buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_conv_arithmetic_perf_test_binAndroid__/pt_vulkan_conv_arithmetic_perf_test_binAndroid
```
- test on arm32 android device
```
~/fbsource » adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_conv_arithmetic_perf_test_binAndroid__/pt_vulkan_conv_arithmetic_perf_test_binAndroid /data/local/tmp/
~/fbsource » adb shell /data/local/tmp/pt_vulkan_conv_arithmetic_perf_test_binAndroid
2023-07-20T20:23:26+00:00
```
- output P817267332, excerpt below
```
Kernel Name Workgroup Size Duration (ns)
=========== ============== ===========
vulkan.add {193, 221, 30} 39475696
vulkan.image_to_nchw {193, 221, 30} 13463424
vulkan.add {193, 221, 30} 72950176
vulkan.image_to_nchw {193, 221, 30} 17792684
[...]
vulkan.add {193, 221, 30} 72986368
vulkan.image_to_nchw {193, 221, 30} 15921672
----------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------------------------------
add_op_benchmark/N:3/C:40/H:221/W:193/iterations:100/manual_time/threads:1 73242 ms 602 ms 100
libc++abi: terminating due to uncaught exception of type c10::Error: Copy of vulkan quantized tensors to cpu is currently disabled!
```
Reviewed By: yipjustin
Differential Revision: D48798710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108266
Approved by: https://github.com/manuelcandales