[Vulkan] Implement slice operator (#69382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69382
Implemented `slice` operator on the Vulkan backend:
* Supports only <= 4D tensors.
* `aten::slice.Tensor` will be executed internally by indexing Tensor.
* Slicing means selecting the elements present in the tensor by using `:` slice operator. We can slice the elements by using the index of that particular element.
* Indexing starts with 0. `end` is exclusive. In this example, we will be getting the elements from the very start to the end index 4(exclusive) of the tensor.
```
tensor = torch.tensor([2, 4, 1, 7, 0, 9])
print(tensor[ : 4])
# Outputs- tensor([2, 4, 1, 7])
```
* Generalized input tensors to 4D ones to simplify input/output texture handling. For example, {2, 3} is treated as {1,1,2,3} internally.
* Negative `start` and `end` inputs are allowed.
* CPU implementation: [/aten/src/ATen/native/TensorShape.cpp::slice()](https://github.com/pytorch/pytorch/blob/3e45739543fbce471fc4ed26ff079efe170de0f1/aten/src/ATen/native/TensorShape.cpp#L1262)
* For **width** dimension, use `vkCmdCopyImage` API,
* input texture size = `{x,y,z}`
* if `step` is 1, copy a region from the input texture to the output texture once where
* source offset = `{start,0,0}`
* destination offset = `{0,0,0}`
* copy extents = `{end-start,y,z}`
* call `vkCmdCopyImage` API
* if `step` is not 1, do for-loop from x=`start` to `end-1` by `step` (also from x_new=`0` to `end-start-1`) where
* x_max = x
* copy extents = `{1,y,z}`
* if (x >= x_max) continue; // out of range
* source offset = `{x,0,0}`
* destination offset = `{x_new,0,0}`
* call `vkCmdCopyImage` API
* For **height** dimension, use `vkCmdCopyImage` API,
* input texture size = `{x,y,z}`
* if `step` is 1, copy a region from the input texture to the output texture once where
* source offset = `{0,start,0}`
* destination offset = `{0,0,0}`
* copy extents = `{x,end-start,z}`
* call `vkCmdCopyImage` API
* if `step` is not 1, do for-loop from y=`start` to `end-1` by `step` (also from y_new=`0` to `end-start-1`) where
* y_max = y
* copy extents = `{x,1,z}`
* if (y >= y_max) continue; // out of range
* source offset = `{0,y,0}`
* destination offset = `{0,y_new,0}`
* call `vkCmdCopyImage` API
* For **batch** and **feature**(channel) dimensions, we build up shader operations from the output texture point of view to avoid the nondeterministic order of GPU shader operations between texels. See [incoherent memory access](https://www.khronos.org/opengl/wiki/Memory_Model#Incoherent_memory_access)
* `b,c,h,w` = input tensor dims (NCHW)
* `b1,c1,h1,w1` = output tensor dims (NCHW)
* `posIn` = position (x,y,z) for input texture
* `posOut` = position (x,y,z) for output texture
* `inval` = input texel value
* `outval` = output texel value
* `max_dst_index` = batch size of output tensor * channel size of output tensor
* `n` = end - start
* `i` = index of input texel (0...3) and `j` = index of output texel (0..3)
* Pseudo code:
```
for (uint j = 0; j < 4; ++j) {
dst_index = posOut.z * 4 + j;
if (dst_index >= max_dst_index) {
save outval to output texture at posOut
break; // out of reange
}
b1 = int(dst_index / channel size of output tensor);
c1 = dst_index % channel size of output tensor;
h1 = posOut.y;
w1 = posOut.x;
b=b1
c=c1
h=h1
w=w1
if (dim==0) { // batch
b=start+step*b1;
} else { // feature(channel)
c=start+step*c1
}
src_index = b * channel size of input tensor + c;
posIn.x = int(w);
posIn.y = int(h);
posIn.z = int(src_index / 4);
i = (src_index % 4);
read inval from input texture at posIn
outval[j] = inval[i]
if (j == 3) {
save outval to output texture at posOut
}
}
```
* Error/edge cases:
* Vulkan backend doesn't support zero-sized slice. It throws an exception when allocating a Vulkan buffer if any dim size is zero.
* The slice step should be positive.
* Generalized test cases with different dim size tensors for batch, feature, height and width. For example, a 4D tensor slicing by dim=width:
```
tensor {2, 3, 40, 50} slicing with dim=3, start=10, end=30, step=1 <-> tensor indexing by [:,:,:,10:30:1]
tensor {2, 3, 40, 50} slicing with dim=3, start=10, end=30, step=7 <-> tensor indexing by [:,:,:,10:30:7]
tensor {2, 3, 40, 50} slicing with dim=3, start=10, end=50, step=2 <-> tensor indexing by [:,:,:,10:50:2] with end=out of range
tensor {2, 3, 40, 50} slicing with dim=3, start=-60, end=60, step=2 <-> tensor indexing by [:,:,:,-60:60:2] with start/end=out of range
tensor {2, 3, 40, 50} slicing with dim=3, start=-30, end=-10, step=2 <-> tensor indexing by [:,:,:,-30:-10:1] with negative start/end
tensor {2, 3, 40, 50} slicing with dim=3, start=0, end=INT64_MAX, step=2 <-> tensor indexing by [:,:,:,0:9223372036854775807:1] with end=INT64_MAX
tensor {2, 3, 40, 50} slicing with dim=3, start=-10, end=INT64_MAX, step=2 <-> tensor indexing by [:,:,:,-10:9223372036854775807:1] with negative start and end=INT64_MAX
tensor {2, 3, 40, 50} slicing with dim=3, start=INT64_MIN, end=INT64_MAX, step=2 <-> tensor indexing by [:,:,:,-9223372036854775808:9223372036854775807:1] with start=INT64_MIN and end=INT64_MAX
tensor {2, 3, 40, 50} slicing with dim=3, start=empty, end=empty, step=2 <-> tensor indexing by [:,:,:,::1] with empty start/end
```
* References:
* [Slicing PyTorch Datasets](https://lewtun.github.io/blog/til/nlp/pytorch/2021/01/24/til-slicing-torch-datasets.html)
* [How to Slice a 3D Tensor in Pytorch?](https://www.geeksforgeeks.org/how-to-slice-a-3d-tensor-in-pytorch/)
* [PyTorch Tensor Indexing API](https://pytorch.org/cppdocs/notes/tensor_indexing.html#translating-between-python-c-index-types)
* [PyTorch Tensor Indexing](https://deeplearninguniversity.com/pytorch/pytorch-tensor-indexing/)
* [Slicing and Striding](https://mlverse.github.io/torch/articles/indexing.html#slicing-and-striding)
* Vulkan `slice` operator tensor conversion:
{F684363708}
Test Plan:
Build & test on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```
Build & test on MacOS:
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```
Test result on Android (Google Pixel 5):
```
[ RUN ] VulkanAPITest.slice_width_success
[ OK ] VulkanAPITest.slice_width_success (17 ms)
[ RUN ] VulkanAPITest.slice_height_success
[ OK ] VulkanAPITest.slice_height_success (13 ms)
[ RUN ] VulkanAPITest.slice_feature_success
[ OK ] VulkanAPITest.slice_feature_success (20 ms)
[ RUN ] VulkanAPITest.slice_batch_success
[ OK ] VulkanAPITest.slice_batch_success (9 ms)
[ RUN ] VulkanAPITest.slice_invalidinputs_exceptions
[ OK ] VulkanAPITest.slice_invalidinputs_exceptions (0 ms)
```
Test result on MacOS:
```
[ RUN ] VulkanAPITest.slice_width_success
[ OK ] VulkanAPITest.slice_width_success (81 ms)
[ RUN ] VulkanAPITest.slice_height_success
[ OK ] VulkanAPITest.slice_height_success (56 ms)
[ RUN ] VulkanAPITest.slice_feature_success
[ OK ] VulkanAPITest.slice_feature_success (132 ms)
[ RUN ] VulkanAPITest.slice_batch_success
[ OK ] VulkanAPITest.slice_batch_success (33 ms)
[ RUN ] VulkanAPITest.slice_invalidinputs_exceptions
[ OK ] VulkanAPITest.slice_invalidinputs_exceptions (1 ms)
```
Reviewed By: SS-JIA
Differential Revision: D32482638
fbshipit-source-id: 65841fb2d3489ee407f2b4f38619b700787d41b0