Skip manual backward for `cdist` with case `p=2` (#31167)
Summary:
Fixes an issue with `cdist` backward calculation for large inputs for the euclidean case.
The grid size when launching the kernel exceeded the 2^16 limit for the second dimension, resulting in `RuntimeError: CUDA error: invalid configuration argument`
Code to reproduce:
```
h, w, d = 800, 1216, 12
n = 133
A = torch.randn(n, d).cuda()
B = torch.randn(h, w, d).cuda()
A.requires_grad = True
B.requires_grad = True
B = B.reshape(-1, d).contiguous()
dist = torch.cdist(A, B)
loss = dist.sum()
loss.backward()
```
Thanks to tkerola for the bug report, reproduction and suggesting a solution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31167
Differential Revision: D20035605
Pulled By: ngimel
fbshipit-source-id: ae28ba4b549ee07a8bd937bb1de2438dc24eaa17