[vulkan] Release GPU resources when vTensor::View is destroyed (#66477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66477
Currently, Vulkan tensor memory is allocated and deallocated through the following mechanism:
1. During inference, ops will request buffer and/or texture memory for tensors from the [Resource Pool](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Resource.h#L324-L327)
2. The resource pool allocates the memory and [adds it to a vector](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Resource.cpp#L609-L622) containing all the memory allocations it has made this inference, then returns the most recently allocated block of memory
3. At the end of inference, results are transferred back to the CPU and the [context is flushed](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/ops/Copy.cpp#L150)
4. As part of the context flush the [resource pool is purged](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Context.cpp#L143) which [deallocates all buffer and texture memory](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Resource.cpp#L683-L684) allocated by the resource pool
This pattern makes it impossible to have models with multiple outputs. When the first output tensor is transferred back to the CPU, the memory of the other output tensors will be deallocated when the context is flushed.
Instead, an alternative is to tie resource destruction to the destructor of the [vTensor::View](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/ops/Tensor.h#L243) class, which holds the actual implementation and storage of Vulkan tensors. This will ensure that memory associated with a tensor will be cleaned up whenever it is no longer used.
The new deallocation mechanism proposed is:
1. During inference, `vTensor` objects will request GPU memory from the resource pool, same as before.
2. The resource pool allocates buffer or texture memory and returns it directly to the `vTensor`
3. Throughout inference, intermediate tensors' reference counts will go to 0 and the destructor of the `View` class will be called
4. The destructor will any texture and buffer memory it's holding to the resource pool's list of GPU memory allocations to be cleaned
5. At the end of inference `purge()` will be called which will destroy all allocations in the list of allocations to be cleaned
6. GPU memory for output tensors will not be destroyed, since their reference counts will be greater than 0, thus they have not yet been added to the list of allocations to be destroyed
Note that it is not correct to have the destructor directly deallocate GPU memory. This is due to the fact that Vulkan ops simply submit work to the GPU but does not guarantee that work has completed when the op returns. Therefore we must keep all allocated GPU memory until the end of inference, when we wait for the GPU to complete work.
Test Plan:
build and run `vulkan_api_test` to make sure existing functionality is not impacted.
Also test in a later diff that checks that output tensors stay alive after inference completes.
Reviewed By: dreiss
Differential Revision: D31510899
fbshipit-source-id: 99250c2800a68f07b1b91dbf5d3b293184da5bd2