Sparse softmax support (CUDA) (#42307)
Summary:
This PR implements softmax support for sparse tensors.
Resolves gh-23651 for CUDA.
- [x] sparse softmax
- [x] CUDA C++ implementation
- [x] unittests
- [x] update softmax documentation
- [x] autograd support
- [x] sparse log_softmax
- [x] CUDA C++ implementation
- [x] unittests
- [x] update log_softmax documentation
- [x] autograd support
Here are some benchmark (script is [here](https://gist.github.com/aocsa/fbc1827b3e49901512a33ba96092cbc1)) results for `torch.sparse.softmax and torch.softmax`, using CPU and GPU, values are float64 scalars, timing repeat is 1000:
| size | density | sparse CUDA | sparse CPU |
|--------------|---------|-------------|------------|
| (32, 10000) | 0.01 | 380.2 | 687.5 |
| (32, 10000) | 0.05 | 404.3 | 2357.9 |
| (32, 10000) | 0.1 | 405.9 | 3677.2 |
| (512, 10000) | 0.01 | 438.0 | 5443.4 |
| (512, 10000) | 0.05 | 888.1 | 24485.0 |
| (512, 10000) | 0.1 | 1921.3 | 45340.5 |
| size | density | dense CUDA | dense CPU |
|--------------|---------|-------------|------------|
| (32, 10000) | 0.01 | 23.6 | 1943.2 |
| (32, 10000) | 0.05 | 23.6 | 1954.0 |
| (32, 10000) | 0.1 | 23.5 | 1950.0 |
| (512, 10000) | 0.01 | 639.3 | 39797.9 |
| (512, 10000) | 0.05 | 640.3 | 39374.4 |
| (512, 10000) | 0.1 | 639.6 | 39192.3 |
Times are in microseconds (us).
Quick note: I updated the performance test again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42307
Reviewed By: ngimel
Differential Revision: D23774427
Pulled By: mruberry
fbshipit-source-id: bfabf726075b39dde544c10249f27ae1871f82c7