Adjust launch_bounds annotation for AMD hardware. (#17555)
Summary:
The max pooling backwards kernel is currently annotated with launch bounds (256,8).
Adjust the number of waves to 4 (4 times 64 is 256) for ROCm. This improves training performance for torchvision models by up to 15% (AlexNet) on a gfx906 GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17555
Differential Revision: D14277744
Pulled By: bddppq
fbshipit-source-id: 2a62088f7b8a87d1e350c432bf655288967c7883