pytorch
28570664 - [Vulkan] Add vulkan_perf_test with google benchmark (#67230)

Commit
3 years ago
[Vulkan] Add vulkan_perf_test with google benchmark (#67230) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67230 Added a new test `vulkan_perf_test` for measuring performance with google benchmark. **Summay:** * `vulkan_perf_test` can be used to perform a quick benchmark test for Vulkan features to compare before and after performance when applying a new method and/or optimizing the existing implementation on your local machine. * The **google benchmark** 3rd party library (https://github.com/google/benchmark) is already in the repo (`fbsource/third-party/benchmark`). * The number of threads is set to 1 since Vulkan backend is not thread-safe. * Added a new API `Context::wait()` to allow benchmark tests to wait for all GPU operations to be done before calling `Context::flush()` * Call `Context::wait()` for each output Vulkan tensor and then `Context::flush()` to avoid out-of-memory issues while running a number of iterations in the benchmark test code * Use `Time` column (wall clock) as a total execution time for each iteration (instead of `CPU` column = CPU execution time only) from the benchmark result table * The more iterations, the more reliable data. But, it will take much longer. 100-1,000 iterations for bigger tensors and 5,000-10,000 iterations for smaller ones would be sufficient. * The benchmark data on MacOS is not reliable since there is an extra layer [MoltenVk](https://github.com/KhronosGroup/MoltenVK) that is running on top of `Metal`. And also running Vulkan models on MacOS instead of Metal ones is generally not a good idea. **Next steps:** * Add more benchmark tests as we optimize more Vulkan operators * Consider using Vulkan own performance counter such as [uVkCompute](https://github.com/google/uVkCompute) in the near future. Each iteration time can be manually set by `benchmark::State::SetIterationTime()` and `Benchmark::UseManualTime()` APIs (see [UseManualTime API](https://github.com/google/benchmark/blob/365670e4328beb694d0a3adaf40a5974a616bb17/include/benchmark/benchmark.h#L1013)) * Consider this `vulkan_perf_test` as a performance BAT (Build Acceptance Test) on the CI pipeline. `gtest` and `google benchmark` can be written in the same place ([see](https://stackoverflow.com/questions/8565666/benchmarking-with-googletest)). And [swiftshader](https://github.com/google/swiftshader) can be used for Sandcastle devservers that don't support Vulkan. We may come up with a reasonable performance criteria for each test and it will fail if any significant performance degradation. Test Plan: **Test build on Android** ``` cd ~/fbsource buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test adb shell "/data/local/tmp/vulkan_perf_test" ``` **Test build on MacOS** ``` cd ~/fbsource buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64 ``` **Test result on Google Pixel 5** ``` Running /data/local/tmp/vulkan_perf_test Run on (8 X 1804.8 MHz CPU s) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------------------------------------------------- Benchmark (Without optimization for 4x channels) Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 60.4 ms 14.1 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 24.1 ms 0.947 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 59.6 ms 14.0 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 5.98 ms 0.844 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 6.02 ms 0.845 ms 5000 ------------------------------------------------------------------------------------------------------------- Benchmark (With optimization for 4x channels) Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 39.3 ms 13.3 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 16.4 ms 3.49 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 59.7 ms 14.1 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 3.93 ms 0.855 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 6.14 ms 0.852 ms 5000 ``` Note that the smaller tensors (`3.93 ms` vs `6.14 ms` when comparing `{3,4,221,193}` with `{3,3,221,193}`) receive significant improvement on the Android builds. Because `vkCmdCopyImage` API is used for the bigger tensor `{3,4,22,193}` instead of shader operations. * `{3,40,221,193}`: 60.4 ms -> 39.3 ms (34.93% faster) * `{3,20,221,193}`: 24.1 ms -> 16.4 ms (31.95% faster) * `{3,4,221,193}`: 5.98 ms -> 3.93 ms (34.28% faster) {F674052834} **Test result on MacOS** ``` Running ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac#macosx-x86_64 Run on (16 X 2400 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 256 KiB (x8) L3 Unified 16384 KiB (x1) Load Average: 5.95, 5.02, 5.15 ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------------------------------------------- Benchmark (Without optimization for 4x channels) Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 51.2 ms 35.5 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 11.4 ms 4.76 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 51.9 ms 35.0 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 2.84 ms 1.36 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 2.30 ms 1.13 ms 5000 ------------------------------------------------------------------------------------------------------------- Benchmark (With optimization for 4x channels) Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 70.1 ms 36.9 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 11.8 ms 5.00 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 69.3 ms 36.8 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 4.60 ms 1.48 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 3.65 ms 1.41 ms 5000 ``` Note that `{3,40,221,193}` input tensors don't receive any performance improvement when we use `vkCmdCopyImage` API to directly copy textures when the number of channel is a multiple of 4 on MacOS. This is maybe due to an extra layer [MoltenVk](https://github.com/KhronosGroup/MoltenVK) that is running on top of `Metal`. Reviewed By: SS-JIA Differential Revision: D31906379 fbshipit-source-id: 0addc766502dba1a915b08840b3a4dc786a9cd9d
Author
Parents
Loading