[CUDA] max_pool2d NCHW performance improvement (#42182)
Summary:
Fix the regression introduced in https://github.com/pytorch/pytorch/issues/38953.
Please see https://github.com/xwang233/code-snippet/blob/master/max-pool2d-nchw-perf/max-pool2d.ipynb for detailed before & after performance comparisons.
Performance improvement for backward max_pool2d before and after this PR (negative value means speed up)
![image](https://user-images.githubusercontent.com/24860335/88712204-363c8e00-d0ce-11ea-8586-057e09b16103.png)
Seems like the forward modulo doesn't benefit much from a similar change, so I did not change forward. https://github.com/pytorch/pytorch/pull/42182/commits/1718f0ccfd84ed33229b69826e8e1c53dc5725f7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42182
Reviewed By: albanD
Differential Revision: D22829498
Pulled By: ngimel
fbshipit-source-id: 4c81968fe072f4e264e70c70ade4c32d760a3af4