pytorch
1cade067 - [Vulkan] Vulkan backend is now thread-safe (#67733)

Commit
3 years ago
[Vulkan] Vulkan backend is now thread-safe (#67733) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67733 Vulkan backend is now thread-safe: * `ThreadContext` class holds onto all per-thread Vulkan states such as Command, Descriptor and Resource objects. * `ThreadContext::SingletonThreadLocalObject<T>` is a very light version of `folly::SingletonThreadLocal` (https://github.com/facebook/folly/blob/main/folly/SingletonThreadLocal.h). It holds a static object with `thread_local` modifier. It is tied with a `GPU` object which allows us to expand multi-threaded GPU backend and multi-GPU capability in the future. The lifetime of `SingletonThreadLocalObject<T>` object is from the first call (instantiation) to the termination of thread. * `MAKE_VULKAN_THREADSAFE` preprocessor is used for BUCK and the implementation of thread-safe Vulkan backend. We can quickly exclude it from the BUCK if any unexpected issue gets uncovered in the future. Once we are confident it's stable, we can remove the preprocessor from the code. * A new perf test is added with `{3,40,221,193}` with 3 threads. * `vkQueueSubmit` is not thread-safe, only one thread can push the commands at a time (See https://vkguide.dev/docs/chapter-1/vulkan_command_flow/#vulkan-command-execution). The number of available queues depends on GPU. It could be 1 and we cannot assume we can create multiple queues. Thus, we need to avoid calling `vkQueueSubmit` from multiple threads at the same time. When running Vulkan backend in different threads without any locking mechanism, `vkQueueSubmit` will get the `VK_ERROR_INITIALIZATION_FAILED(-3)` error. * In the `Context::~Context()`, we should not call `flush()` since all per-thread objects will be destroyed as each thread exits. From the following logs, you can verify all per-thread objects are getting destroyed as their threads are terminated. The logs captured all ctor/dtor calls when running Vulkan backend with 3 different threads: ``` ThreadContext::ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28] Context::Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1] Resource::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00] Command::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003] Resource::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00] Command::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008] Resource::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00] Command::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d] Descriptor::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d] Descriptor::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e] Descriptor::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f] Descriptor::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f] -> enter Descriptor::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f] -> leave Command::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003] -> enter Command::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003] -> leave Resource::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00] -> enter Descriptor::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e] -> enter Descriptor::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e] -> leave Command::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008] -> enter Command::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008] -> leave Resource::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00] -> enter Resource::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00] -> leave Resource::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00] -> leave Descriptor::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d] -> enter Descriptor::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d] -> leave Command::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d] -> enter Command::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d] -> leave Resource::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00] -> enter Resource::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00] -> leave Context::~Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1] -> enter Context::~Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1] -> leave ThreadContext::~ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28] -> enter ThreadContext::~ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28] -> leave ``` Some notes on unexpected behaviors by `VkQueue`: * We need to make sure only one thread accesses `VkQueue` at a time if multi-threaded. Or we need to have a locking mechanism to protect `VkQueue` from multiple threads. This approach is used for this change. * To avoid having lock overhead, we tried to have per-thread `VkQueue` (having separate object per thread) didn't fix `VK_ERROR_INITIALIZATION_FAILED` error by `vkQueueSubmit` call. This was not expected. Interestingly, MacOS doesn't crash with this per-thread approach but no wonder since its behavior has been not that reliable. Not sure it's an Android Vulkan driver issue or not. * Making the entire `Context` as `thread_local` without any lock actually fixes the same error. Test Plan: **Test build on Android** ``` cd ~/fbsource buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test adb shell "/data/local/tmp/vulkan_perf_test" ``` **Test build on MacOS** ``` cd ~/fbsource buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64 ``` **Test result on Google Pixel 5** ``` //xplat/caffe2:pt_vulkan_perf_test_binAndroid#android-arm64 buck-out/gen/fe3a39b8/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64 buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64: 1 file pushed, 0 skipped. 145.4 MB/s (826929592 bytes in 5.426s) Running /data/local/tmp/vulkan_perf_test Run on (8 X 1804.8 MHz CPU s) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ============================================================================================================= Thread-safe Vulkan backend on Google Pixel 5 ------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 55.8 ms 15.1 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 25.6 ms 4.08 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 60.6 ms 14.3 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 4.52 ms 0.757 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 7.16 ms 0.770 ms 5000 cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3 35.9 ms 38.8 ms 3000 ============================================================================================================= Non thread-safe Vulkan backend on Google Pixel 5 ------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 55.0 ms 14.5 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 25.8 ms 4.30 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 60.6 ms 14.5 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 4.52 ms 0.761 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 7.15 ms 0.765 ms 5000 ``` For the single thread scenario of thread-safe and non thread-safe versions, the difference between them is less than 2% which is acceptable. In other words, there is no considerable performance degradation with the thread-safe Vulkan backend by using: * singleton thread local objects for `Command`, `Descriptor` and `Resource` pools * mutex lock for `VkQueueCommit` call **Test result on MacOS** ``` Running ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac#macosx-x86_64 Run on (16 X 2400 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 256 KiB (x8) L3 Unified 16384 KiB (x1) Load Average: 11.96, 7.17, 5.45 ***WARNING*** Library was built as DEBUG. Timings may be affected. ============================================================================================================= Thread-safe Vulkan backend on MacOS ------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 58.4 ms 42.8 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 12.3 ms 5.43 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 56.0 ms 41.2 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 3.00 ms 1.52 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 2.56 ms 1.34 ms 5000 cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3 42.8 ms 42.8 ms 3000 ============================================================================================================= Non thread-safe Vulkan backend on MacOS ------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------- cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 58.6 ms 42.6 ms 1000 cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 11.3 ms 4.67 ms 1000 cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 57.6 ms 42.4 ms 1000 cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 2.89 ms 1.45 ms 5000 cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 2.47 ms 1.27 ms 5000 ``` Non thread-safe version is slightly faster than the thread-safe one. This test result is only for reference since we cannot trust MacOS that has an extra layer [MoltenVk](https://github.com/KhronosGroup/MoltenVK) on top of `Metal`. Reviewed By: SS-JIA Differential Revision: D32093974 fbshipit-source-id: 9eab7f0db976eff717540a5b32f94ed17a00b662
Author
Parents
Loading