Generalize HIP-specific launch bounds to apply to CUDA as well (#56143)
Summary:
Launch bounds for HIP were added along the way, but the smaller CUDA devices (like Jetson) also benefit from them.
So here I go over the HIP-specific launch bounds and try to generalize them to cover CUDA, too.
The long term goal is to eventually not need to resort to somewhat ad-hoc adaptations like the reduction of block size discussed in https://github.com/pytorch/pytorch/issues/8103, but have good coverage of our kernels with launch bound annotations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56143
Reviewed By: agolynski
Differential Revision: D27804640
Pulled By: ngimel
fbshipit-source-id: d4c345f9f7503e050a46361bfe2625865d0a42ba