vulkan: change graph_compute to be async and enable get_tensor_async (#17158)
* vulkan: change graph_compute to be async and enable get_tensor_async
This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.
Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.
The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.
The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.
* fix thread safety errors
* teardown context cleanly
* Handle async read to non-pinned dst