[vulkan] Adaptive local work group size (#61170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61170
Instead of using a fixed local work group size of {4,4,4}, adjust the size based on the global size in order to minimize the number of inactive invocations.
## Perf improvements from this change
On aloha portal devices, in conjunction with the below diff that tweaks the conv2d_pw shader to calculate a 4x4 output, benchmark latency of the xirp14b model was reduced from ~8.7 ms to ~6.6 ms.
Test Plan:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```
Reviewed By: IvanKobzarev
Differential Revision: D28724591
fbshipit-source-id: ede896300b2be1a9578e492cb870121012886aa7