`__launch_bounds__` for `torch.mode` with CUDA 11.7 (#79710)
This is a temporary fix for `TestReductionsCUDA.test_mode_large_cuda` which fails with CUDA 11.7 due to the following:
```
Traceback (most recent call last):
File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 1805, in wrapper
method(*args, **kwargs)
File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 1805, in wrapper
method(*args, **kwargs)
File "/opt/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 390, in instantiated_test
raise rte
File "/opt/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
result = test(self, **param_kwargs)
File "/opt/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 943, in only_fn
return fn(slf, *args, **kwargs)
File "test_reductions.py", line 891, in test_mode_large
testset_for_shape((10, 2048), 10)
File "test_reductions.py", line 883, in testset_for_shape
self._test_mode_intervals(shape, [(i, d - i)], device)
File "test_reductions.py", line 870, in _test_mode_intervals
values, indices = torch.mode(x, -1, False)
RuntimeError: CUDA error: too many resources requested for launch
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```
cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79710
Approved by: https://github.com/malfet