Fix the issue when NHWC Tensor has height or width larger then max cuda grid (#28931)
Summary:
When NHWC Tensor has height or width larger then max CUDA grid size, max_pool fails with error code 0
The example is: https://github.com/pytorch/pytorch/issues/28714
This change should limit grid size to the CUDA max possible size and chunk the input to be able to process it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28931
Differential Revision: D18358892
Pulled By: ifedan
fbshipit-source-id: 2fd65448bd644f1588a0e208edaaea5bcb6a7d52