Speed up torch.unique_consecutive() (#64835)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62690
Like the way `unique_consecutive_cpu_template` implemented, this PR reimplements `_unique_dim_cpu_impl` to get better performance.
Also, because the overhead of `unique_dim_consecutive_cpu` is quite large, directly call `unique_consecutive_cpu_template` when we know the given input is a 1d-array.
## Benchmark
### Script
```python
import torch
import time
torch.manual_seed(0)
t = torch.randint(500, (10000000, ))
t = torch.sort(t)[0]
start = time.time()
uniques, inverse, counts = torch.unique_consecutive(t, dim=0, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive(dim=0) time:", end - start)
start = time.time()
uniques2, inverse2, counts2 = torch.unique_consecutive(t, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive() time:", end - start)
t = torch.randint(500, (10000000, 2))
t = torch.sort(t)[0]
start = time.time()
uniques, inverse, counts = torch.unique_consecutive(t, dim=0, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive(dim=0) time:", end - start)
start = time.time()
uniques, inverse, counts = torch.unique_consecutive(t, dim=1, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive(dim=1) time:", end - start)
```
### Before
```
torch.unique_consecutive(dim=0) time: 78.64345622062683
torch.unique_consecutive() time: 0.029544353485107422
torch.unique_consecutive(dim=0) time: 91.49796152114868
torch.unique_consecutive(dim=1) time: 0.30872368812561035
```
### After
```
torch.unique_consecutive(dim=0) time: 0.08256125450134277
torch.unique_consecutive() time: 0.08162403106689453
torch.unique_consecutive(dim=0) time: 35.58408498764038
torch.unique_consecutive(dim=1) time: 1.6258199214935303
```
## System Information
```
Collecting environment information...
PyTorch version: 1.10.0a0+git7f1932e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.10.0a0+gitbe09195
[conda] Could not collect
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64835
Reviewed By: jbschlosser
Differential Revision: D30894906
Pulled By: ngimel
fbshipit-source-id: 42ab76d638391ce6c4e589d9c71bdf7579310ad9