Make 1D integer sorting work in parallel (#100081)
This patch reuses `radix_sort` from fbgemm and makes `torch.(arg)sort` work in parallel for tensors filled with integers.
In GNN workloads we often use `torch.(arg)sort`, for example, to calculate permutation from CSR to CSC storage format. Till now, sorting one-dimensional data was performed sequentially. Recently, `radix_sort` implementation from FBGEMM was moved to common utilities and was also enhanced, to cover negative numbers ([pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)). This gives us an opportunity to reuse `radix_sort` to accelerate 1D integer sorting in PyTorch.
Benchmark results, measured on a single socket, 56C machine:
Before (int64):
```
size: 64000, average run time (from 100 runs): 6.592ms
size: 128000, average run time (from 100 runs): 9.798ms
size: 256000, average run time (from 100 runs): 19.199ms
size: 512000, average run time (from 100 runs): 36.394ms
size: 1024000, average run time (from 100 runs): 70.371ms
size: 2048000, average run time (from 100 runs): 137.752ms
size: 4096000, average run time (from 100 runs): 287.257ms
```
After(int64):
```
size: 64000, average run time (from 100 runs): 1.553ms
size: 128000, average run time (from 100 runs): 1.853ms
size: 256000, average run time (from 100 runs): 2.873ms
size: 512000, average run time (from 100 runs): 4.323ms
size: 1024000, average run time (from 100 runs): 7.184ms
size: 2048000, average run time (from 100 runs): 14.250ms
size: 4096000, average run time (from 100 runs): 29.374ms
```
Notes:
Average speedup from measured tensor sizes is 7.7x.
For smaller types (e.g. int32/int16), even higher speedup is observed, as fewer passes are required.
Depends on #100236.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100081
Approved by: https://github.com/mingfeima, https://github.com/ngimel