[vulkan] VulkanTensor lazy buffer allocation (#42569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42569
We do not need to allocate buffers for Vulkan Tensors if they are not the forward input or output.
Removing allocate_storage() for outputs of operations by default, their image representation will have the result.
Allocating buffer only if it was requested for the operations (For some ops like concatenate, transpose) or copy to host.
`VulkanTensor.image()` if buffer was not allocated - just allocates texture skipping copy from buffer to texture.
As allocate storage was before for all operations - we are saving buffer allocation and buffer_to_image call.
MobilNetV2 on my Pixel4:
```
flame:/data/local/tmp $ ./speed_benchmark_torch --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 305818. Iters per second: 3.26991
Segmentation fault
```
```
139|flame:/data/local/tmp $ ./speed_benchmark_torch_noas --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 236768. Iters per second: 4.22355
Segmentation fault
```
Test Plan: Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D22946552
Pulled By: IvanKobzarev
fbshipit-source-id: ac0743bb316847632a22cf9aafb8938e50b2fb7b