Made max_pool2d_with_indices_backward_cuda contiguify `indices` (#85493)
Currently, max_pool2d_with_indices_backward(grad_output, self, ..., indices)
(on cuda) assumes that indices has the same suggested memory format as
self.
This is indeed always true in regular PyTorch: the max_pool2d_with_indices
forward pass returns indices with the same suggeted memory format as
self.
However, we'd like to make an argument that always contiguifying indices
is good for consistency, has negligible added cost, and is more robust
(for Tensor Subclass authors):
- the max_pool2d_with_indices_backward implementation for CPU always
contiguifies `indices`. Ditto for the max_pool3d_with_indices_backward
implementation.
- Calling .contiguous() has almost no cost (compared to before) because
there is a fast-path that checks the cached memory_format on the
TensorImpl.
- functorch has trouble writing a batching rule for
`max_pool2d_with_indices_backward`. Having it accept `indices` with
arbitrary strides helps make it so that vmap doesn't need to special
case the batching rule for the strides of `indices`.
Test Plan:
- Not sure if it's worth writing a separate test case
- this PR fixes one of functorch's test cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85493
Approved by: https://github.com/ezyang