[Vulkan] Partially fix and then disable copying of vulkan quantized tensors to cpu (#90275)
Summary:
Before this diff, copying of vulkan quantized tensors to cpu was broken. This was mainly caused because the shader only works properly with specific global and local work group sizes, and those specific sizes had been modified in earlier refactoring.
As part of this fix, an optimized version of the shader that performs the copying was written, to take advantage of the special case when the plane size (x*y) is multiple of 4).
After fixing this, and writing comprehensive tests, it was discovered that the copying still has issues on Android for specific input sizes, e.g. [1, 1, 11, 17]. These issues are currently unresolved, so, copying of quantized vulkan tensors to cpu has been disabled.
What is contained in this diff?
- Fix for existing issue
- New optimized shader (image_to_nchw_quantized_mul4)
- New comprehensive tests (which have been disabled)
- Disable the copying of quantized vulkan tensors to cpu until issues on Android are fixed.
Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```
On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```
Reviewed By: kimishpatel
Differential Revision: D41047098
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90275
Approved by: https://github.com/kimishpatel