[Vulkan] Vulkan backend is now thread-safe (#67733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67733
Vulkan backend is now thread-safe:
* `ThreadContext` class holds onto all per-thread Vulkan states such as Command, Descriptor and Resource objects.
* `ThreadContext::SingletonThreadLocalObject<T>` is a very light version of `folly::SingletonThreadLocal` (https://github.com/facebook/folly/blob/main/folly/SingletonThreadLocal.h). It holds a static object with `thread_local` modifier. It is tied with a `GPU` object which allows us to expand multi-threaded GPU backend and multi-GPU capability in the future. The lifetime of `SingletonThreadLocalObject<T>` object is from the first call (instantiation) to the termination of thread.
* `MAKE_VULKAN_THREADSAFE` preprocessor is used for BUCK and the implementation of thread-safe Vulkan backend. We can quickly exclude it from the BUCK if any unexpected issue gets uncovered in the future. Once we are confident it's stable, we can remove the preprocessor from the code.
* A new perf test is added with `{3,40,221,193}` with 3 threads.
* `vkQueueSubmit` is not thread-safe, only one thread can push the commands at a time (See https://vkguide.dev/docs/chapter-1/vulkan_command_flow/#vulkan-command-execution). The number of available queues depends on GPU. It could be 1 and we cannot assume we can create multiple queues. Thus, we need to avoid calling `vkQueueSubmit` from multiple threads at the same time. When running Vulkan backend in different threads without any locking mechanism, `vkQueueSubmit` will get the `VK_ERROR_INITIALIZATION_FAILED(-3)` error.
* In the `Context::~Context()`, we should not call `flush()` since all per-thread objects will be destroyed as each thread exits. From the following logs, you can verify all per-thread objects are getting destroyed as their threads are terminated. The logs captured all ctor/dtor calls when running Vulkan backend with 3 different threads:
```
ThreadContext::ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28]
Context::Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1]
Resource::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00]
Command::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003]
Resource::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00]
Command::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008]
Resource::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00]
Command::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d]
Descriptor::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d]
Descriptor::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e]
Descriptor::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f]
Descriptor::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f] -> enter
Descriptor::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f] -> leave
Command::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003] -> enter
Command::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003] -> leave
Resource::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00] -> enter
Descriptor::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e] -> enter
Descriptor::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e] -> leave
Command::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008] -> enter
Command::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008] -> leave
Resource::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00] -> enter
Resource::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00] -> leave
Resource::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00] -> leave
Descriptor::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d] -> enter
Descriptor::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d] -> leave
Command::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d] -> enter
Command::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d] -> leave
Resource::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00] -> enter
Resource::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00] -> leave
Context::~Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1] -> enter
Context::~Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1] -> leave
ThreadContext::~ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28] -> enter
ThreadContext::~ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28] -> leave
```
Some notes on unexpected behaviors by `VkQueue`:
* We need to make sure only one thread accesses `VkQueue` at a time if multi-threaded. Or we need to have a locking mechanism to protect `VkQueue` from multiple threads. This approach is used for this change.
* To avoid having lock overhead, we tried to have per-thread `VkQueue` (having separate object per thread) didn't fix `VK_ERROR_INITIALIZATION_FAILED` error by `vkQueueSubmit` call. This was not expected. Interestingly, MacOS doesn't crash with this per-thread approach but no wonder since its behavior has been not that reliable. Not sure it's an Android Vulkan driver issue or not.
* Making the entire `Context` as `thread_local` without any lock actually fixes the same error.
Test Plan:
**Test build on Android**
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test
adb shell "/data/local/tmp/vulkan_perf_test"
```
**Test build on MacOS**
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64
```
**Test result on Google Pixel 5**
```
//xplat/caffe2:pt_vulkan_perf_test_binAndroid#android-arm64 buck-out/gen/fe3a39b8/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64
buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64: 1 file pushed, 0 skipped. 145.4 MB/s (826929592 bytes in 5.426s)
Running /data/local/tmp/vulkan_perf_test
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
=============================================================================================================
Thread-safe Vulkan backend on Google Pixel 5
-------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 55.8 ms 15.1 ms 1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 25.6 ms 4.08 ms 1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 60.6 ms 14.3 ms 1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 4.52 ms 0.757 ms 5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 7.16 ms 0.770 ms 5000
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3 35.9 ms 38.8 ms 3000
=============================================================================================================
Non thread-safe Vulkan backend on Google Pixel 5
-------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 55.0 ms 14.5 ms 1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 25.8 ms 4.30 ms 1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 60.6 ms 14.5 ms 1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 4.52 ms 0.761 ms 5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 7.15 ms 0.765 ms 5000
```
For the single thread scenario of thread-safe and non thread-safe versions, the difference between them is less than 2% which is acceptable. In other words, there is no considerable performance degradation with the thread-safe Vulkan backend by using:
* singleton thread local objects for `Command`, `Descriptor` and `Resource` pools
* mutex lock for `VkQueueCommit` call
**Test result on MacOS**
```
Running ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac#macosx-x86_64
Run on (16 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 11.96, 7.17, 5.45
***WARNING*** Library was built as DEBUG. Timings may be affected.
=============================================================================================================
Thread-safe Vulkan backend on MacOS
-------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 58.4 ms 42.8 ms 1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 12.3 ms 5.43 ms 1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 56.0 ms 41.2 ms 1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 3.00 ms 1.52 ms 5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 2.56 ms 1.34 ms 5000
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3 42.8 ms 42.8 ms 3000
=============================================================================================================
Non thread-safe Vulkan backend on MacOS
-------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1 58.6 ms 42.6 ms 1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1 11.3 ms 4.67 ms 1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1 57.6 ms 42.4 ms 1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1 2.89 ms 1.45 ms 5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1 2.47 ms 1.27 ms 5000
```
Non thread-safe version is slightly faster than the thread-safe one. This test result is only for reference since we cannot trust MacOS that has an extra layer [MoltenVk](https://github.com/KhronosGroup/MoltenVK) on top of `Metal`.
Reviewed By: SS-JIA
Differential Revision: D32093974
fbshipit-source-id: 9eab7f0db976eff717540a5b32f94ed17a00b662