Migrate dist from TH to ATen(CPU, CUDA) (#29714)
Summary:
[https://github.com/pytorch/pytorch/issues/24691](https://github.com/pytorch/pytorch/issues/24691)
[https://github.com/pytorch/pytorch/issues/24551](https://github.com/pytorch/pytorch/issues/24551)
Benchmark:
**Speed**
```python
import time, sys
import torch
import math
inf = math.inf
torch.manual_seed(0)
devices = ["cpu", "cuda"]
ps = [0, 1, 2, 3, 4, inf, -inf]
# Warm up
for device in devices:
for n in [1, 10, 100, 1000]:
x = torch.randn(100, n, requires_grad=False, device=device)
y = torch.randn(100, n, requires_grad=False, device=device)
for i in range(1000):
for p in ps:
dist_xy = torch.dist(x, y, p)
for device in devices:
print('On {}'.format(device))
for n in [1, 10, 100, 1000]:
total_time = 0
x = torch.randn(100, n, requires_grad=False, device=device)
y = torch.randn(100, n, requires_grad=False, device=device)
for i in range(10000):
for p in ps:
t1 = time.time()
dist_xy = torch.dist(x, y, p)
t2 = time.time()
total_time += (t2 - t1)
average_time = total_time / 10000 / len(ps) * 1000
print("input size(100, %d) average time is %.8f (ms)." % (n, average_time))
```
Output
Before:
```shel
On cpu
input size(100, 1) average time is 0.0079491 (ms).
input size(100, 10) average time is 0.0364167 (ms).
input size(100, 100) average time is 0.3120752 (ms).
input size(100, 1000) average time is 3.0605820 (ms).
On cuda
input size(100, 1) average time is 0.04745627 (ms).
input size(100, 10) average time is 0.04919453 (ms).
input size(100, 100) average time is 0.06601572 (ms).
input size(100, 1000) average time is 0.07849015 (ms).
```
After:
```shell
On cpu
input size(100, 1) average time is 0.0099936 (ms).
input size(100, 10) average time is 0.0340414 (ms).
input size(100, 100) average time is 0.2793379 (ms).
input size(100, 1000) average time is 0.7858076 (ms).
On cuda
input size(100, 1) average time is 0.04410237 (ms).
input size(100, 10) average time is 0.03326339 (ms).
input size(100, 100) average time is 0.03314828 (ms).
input size(100, 1000) average time is 0.03990038 (ms).
```
**Precision**
```python
for device in devices:
torch.manual_seed(0)
print('On {}'.format(device))
for n in [1, 10, 100, 1000]:
x = torch.randn(100, n, requires_grad=False).to(device)
y = torch.randn(100, n, requires_grad=False).to(device)
for p in ps:
dist_xy_float = torch.dist(x, y, p)
dist_xy_double = torch.dist(x.double(), y.double(), p)
difference = torch.abs(dist_xy_double - dist_xy_float)
print('input size (100, {}), p: {}, float: {}, double: {}, difference: {}'.format(n, p, dist_xy_float, dist_xy_double, difference))
```
Part of [output](https://gist.github.com/rivergold/dd95014dc7f163b22f72699d1134cdd2)
Before:
```shell
On cpu
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.1806640625, double: 11474.185433543797, difference: 0.00476948129653465
input size (100, 100), p: 2, float: 143.50729370117188, double: 143.5073391487937, difference: 4.5447621829453055e-05
input size (100, 100), p: 3, float: 36.045475006103516, double: 36.04550275212738, difference: 2.774602386779179e-05
input size (100, 100), p: 4, float: 18.796083450317383, double: 18.79609807865317, difference: 1.4628335787136848e-05
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
On cuda
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.1865234375, double: 11474.185433543797, difference: 0.00108989370346535
input size (100, 100), p: 2, float: 143.50733947753906, double: 143.5073391487933, difference: 3.2874575595087663e-07
input size (100, 100), p: 3, float: 36.04550552368164, double: 36.045502752127405, difference: 2.7715542358919265e-06
input size (100, 100), p: 4, float: 18.796098709106445, double: 18.796098078653177, difference: 6.304532682577246e-07
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
```
After
```shell
On cpu
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.1806640625, double: 11474.185433543797, difference: 0.00476948129653465
input size (100, 100), p: 2, float: 143.50729370117188, double: 143.5073391487937, difference: 4.5447621829453055e-05
input size (100, 100), p: 3, float: 36.045475006103516, double: 36.04550275212738, difference: 2.774602386779179e-05
input size (100, 100), p: 4, float: 18.796083450317383, double: 18.79609807865317, difference: 1.4628335787136848e-05
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
On cuda
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.185546875, double: 11474.185433543797, difference: 0.00011333120346534997
input size (100, 100), p: 2, float: 143.50733947753906, double: 143.5073391487933, difference: 3.2874575595087663e-07
input size (100, 100), p: 3, float: 36.04550552368164, double: 36.045502752127405, difference: 2.7715542358919265e-06
input size (100, 100), p: 4, float: 18.796096801757812, double: 18.796098078653177, difference: 1.2768953645547754e-06
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29714
Differential Revision: D19769518
Pulled By: albanD
fbshipit-source-id: 69b79b64f1f190b410efe884662b6601e903eccf