pytorch
c537acf4 - Make 1D integer sorting work in parallel (#100081)

Commit
2 years ago
Make 1D integer sorting work in parallel (#100081) This patch reuses `radix_sort` from fbgemm and makes `torch.(arg)sort` work in parallel for tensors filled with integers. In GNN workloads we often use `torch.(arg)sort`, for example, to calculate permutation from CSR to CSC storage format. Till now, sorting one-dimensional data was performed sequentially. Recently, `radix_sort` implementation from FBGEMM was moved to common utilities and was also enhanced, to cover negative numbers ([pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)). This gives us an opportunity to reuse `radix_sort` to accelerate 1D integer sorting in PyTorch. Benchmark results, measured on a single socket, 56C machine: Before (int64): ``` size: 64000, average run time (from 100 runs): 6.592ms size: 128000, average run time (from 100 runs): 9.798ms size: 256000, average run time (from 100 runs): 19.199ms size: 512000, average run time (from 100 runs): 36.394ms size: 1024000, average run time (from 100 runs): 70.371ms size: 2048000, average run time (from 100 runs): 137.752ms size: 4096000, average run time (from 100 runs): 287.257ms ``` After(int64): ``` size: 64000, average run time (from 100 runs): 1.553ms size: 128000, average run time (from 100 runs): 1.853ms size: 256000, average run time (from 100 runs): 2.873ms size: 512000, average run time (from 100 runs): 4.323ms size: 1024000, average run time (from 100 runs): 7.184ms size: 2048000, average run time (from 100 runs): 14.250ms size: 4096000, average run time (from 100 runs): 29.374ms ``` Notes: Average speedup from measured tensor sizes is 7.7x. For smaller types (e.g. int32/int16), even higher speedup is observed, as fewer passes are required. Depends on #100236. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100081 Approved by: https://github.com/mingfeima, https://github.com/ngimel
Committer
Parents
Loading