[cuDNN] Work-around 32-bit indexing failures in cuDNN batchnorm (#81486)
The following workload fails on Ampere:
```
import torch
import torch.nn as nn
bn = nn.BatchNorm2d(128).cuda()
x = torch.randn([256, 128, 294, 294], device='cuda')
x = bn(x)
print(x.shape)
```
This PR adds a magic number in the `use_cudnn` condition to avoid this failure.
CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81486
Approved by: https://github.com/ngimel