[iOS GPU][Design] Support multiple tensors as outputs (#56072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56072
Currently, we don't support outputting more than one tensors on GPU. For example, if you do
```
auto x = at::rand(1,4,2,2).metal();
auto y = at::chunk(x,2,1); //y is a tuple
auto output1 = y[0].cpu();
auto output2 = y[1].cpu();
```
In the example above, when it hits `y[0].cpu()`, the command buffer will be committed to move `y[0]` from GPU to CPU. By the time it hits `y[1].cpu()`, since the command buffer has already become invalid, the temporary image that lives in `output2` has been recycled. Thus, a runtime exception will be thrown.
The way we address it is using the observer pattern
1. Before we flush the command buffer, we'll notify its the observers(a.k.a MPSImageWrapper objects) who hold the temporary images.
2. When observers receive the notification, they'll turn the current temporary images into a static images.
3. Now, when `.cpu()` happens, the output tensor can just read the data directly from the static image generated in the above step.
You may be wondering does that have a hidden cost where all the intermediate tensors have hold unused static images? The answers is no. All intermediate tensors will be released once their reference counts become zero. Since the MetalTensorImpl is subclassing from the TensorImpl, we're overriding the release_resource method which gives us a chance to release the underlying storage (textures and buffers) and remove observers from the command buffer. Therefore, once the intermediate tensors go away, the temporary images will be recycled immediately.
ghstack-source-id: 127079751
Test Plan:
- We'll be using `at::chunk` to test this in the following diffs, as it returns a tuple that contains multiple tensors.
- Sandcastle CI
- CircleCI
Reviewed By: dreiss
Differential Revision: D27165886
fbshipit-source-id: 290b0d77b1dc74990b25cbd0abb775df1ab47ca0