Fix adaptive_max_pool2d for channels-last on CUDA (#67697)
Summary:
Fix https://github.com/pytorch/pytorch/issues/67239
The CUDA kernels for `adaptive_max_pool2d` (forward and backward) were written for contiguous output. If outputs are non-contiguous, first create a contiguous copy and let the kernel write output to the contiguous memory space. Then copy the output from contiguous memory space to the original non-contiguous memory space.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67697
Reviewed By: ejguan
Differential Revision: D32112443
Pulled By: ngimel
fbshipit-source-id: 0e3bf06d042200c651a79d13b75484526fde11fe