tune elementwise for AMD uarch (#16217)
Summary:
Tune elementwise kernel for AMD architectures by increasing the work group sizes and launch bounds. This change improves training throughput for torchvision models by up to 11% in our tests while exhibiting no significant performance regression.
No functional/performance change for CUDA - just shifting numbers into constrexpr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16217
Differential Revision: D13776684
Pulled By: bddppq
fbshipit-source-id: edbaebe904598b2de66a9e9a68a1aa219ebc01e9