[vulkan] Use busy polling when waiting for VkFence (#81470)
When waiting for a `VkFence`, busy poll instead of calling `vkWaitForFences`.
It appears that `vkWaitForFences` is implemented in such a way that the calling thread is put to sleep until it receives a signal from the fence. This causes the reduction in CPU frequency and causes subsequent CPU operations (such as the memcpy back to a CPU tensor) to take much longer than it otherwise would have. Busy waiting keeps the CPU hot and therefore avoids this problem.
Making this change drastically reduces benchmark latency. Before this change, the xirp20a model on Pixel 3 ran at an average latency of 16 ms per iteration. After switching to busy polling, benchmark latency sits at 10 ms per iteration. This is presumably caused by having the CPU stay at a clock frequency rather than constantly switching between high and low frequencies.
Differential Revision: [D37800456](https://our.internmc.facebook.com/intern/diff/D37800456/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81470
Approved by: https://github.com/kimishpatel