llama.cpp
38eaf32a - vulkan: change graph_compute to be async and enable get_tensor_async (#17158)

Commit

122 days ago

vulkan: change graph_compute to be async and enable get_tensor_async (#17158) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst

References

#17158 - vulkan: change graph_compute to be async and enable get_tensor_async

Author

jeffbolznv

Parents

9b17d74a

llama.cpp 38eaf32a - vulkan: change graph_compute to be async and enable get_tensor_async (#17158)

llama.cpp
38eaf32a - vulkan: change graph_compute to be async and enable get_tensor_async (#17158)