Fix max_pool2d NHWC for large tensors; fix incorrect use of cudaGetLastError() (#34519)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33988 and fix https://github.com/pytorch/pytorch/issues/34083.
Previously, the max_pool2d_nhwc kernels used a shared memory with size proportional to the tensor size (c \* h \* w). When the tensor size is too large, the kernel launch fails.
This PR follows the guidance in AdaptiveAvgPool2d_nhwc by increasing the number of grid_x with split in "C" dimension. With that change, there will be a maximum limit in the shared memory size (which is less than 48 kb) regardless of tensor size.
A benchmark can be found at [here](https://github.com/xwang233/code-snippet/blob/0b98146089ffca65d3d56968a9eafbb401a82493/max-pool2d/max-pool2d.ipynb). TL;DR barely any performance drop is found.
cc csarofeen ptrblck jjsjann123 VitalyFedyunin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34519
Differential Revision: D20388848
Pulled By: VitalyFedyunin
fbshipit-source-id: 9454f385f9315afaab4a05303305578bbcd80b87