Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265
This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.
Specifically, when
* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,
cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.
https://github.com/pytorch/pytorch/blob/8c0949ae454b1d2c1b626a5ea19ba5ea6487d305/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu#L742-L752
The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.
On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.
https://github.com/pytorch/pytorch/blob/060769feaf02db56ac79e0c73dab1105828ece69/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h#L10-L13
Note that there is a new heuristic used before cusolver/cublas calls here:
https://github.com/pytorch/pytorch/blob/8c0949ae454b1d2c1b626a5ea19ba5ea6487d305/aten/src/ATen/native/cuda/MiscUtils.h#L113-L121
where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).
Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests
Next step:
If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.
<details>
<summary> benchmark 73499c6 </summary>
benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb
shape meaning:
* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`
| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 | 0.095 | 7.534 | 0.129 |
| [] 4 torch.float32 | 0.009 | 7.522 | 0.129 |
| [] 8 torch.float32 | 0.011 | 7.647 | 0.138 |
| [] 16 torch.float32 | 0.075 | 7.582 | 0.135 |
| [] 32 torch.float32 | 0.073 | 7.573 | 0.191 |
| [] 64 torch.float32 | 0.134 | 7.694 | 0.288 |
| [] 128 torch.float32 | 0.398 | 8.073 | 0.491 |
| [] 256 torch.float32 | 1.054 | 11.860 | 1.074 |
| [] 512 torch.float32 | 5.218 | 14.130 | 2.582 |
| [] 1024 torch.float32 | 19.010 | 18.780 | 6.936 |
| [1] 2 torch.float32 | 0.009 | 0.113 | 0.128 ***regressed |
| [1] 4 torch.float32 | 0.009 | 0.113 | 0.131 ***regressed |
| [1] 8 torch.float32 | 0.011 | 0.116 | 0.129 ***regressed |
| [1] 16 torch.float32 | 0.015 | 0.122 | 0.135 ***regressed |
| [1] 32 torch.float32 | 0.032 | 0.177 | 0.178 ***regressed |
| [1] 64 torch.float32 | 0.070 | 0.420 | 0.281 |
| [1] 128 torch.float32 | 0.328 | 0.816 | 0.490 |
| [1] 256 torch.float32 | 1.125 | 1.690 | 1.084 |
| [1] 512 torch.float32 | 4.344 | 4.305 | 2.576 |
| [1] 1024 torch.float32 | 16.510 | 16.340 | 6.928 |
| [2] 2 torch.float32 | 0.009 | 0.113 | 0.186 ***regressed |
| [2] 4 torch.float32 | 0.011 | 0.115 | 0.184 ***regressed |
| [2] 8 torch.float32 | 0.012 | 0.114 | 0.184 ***regressed |
| [2] 16 torch.float32 | 0.019 | 0.119 | 0.173 ***regressed |
| [2] 32 torch.float32 | 0.050 | 0.170 | 0.240 ***regressed |
| [2] 64 torch.float32 | 0.120 | 0.429 | 0.375 |
| [2] 128 torch.float32 | 0.576 | 0.830 | 0.675 |
| [2] 256 torch.float32 | 2.021 | 1.748 | 1.451 |
| [2] 512 torch.float32 | 9.070 | 4.749 | 3.539 |
| [2] 1024 torch.float32 | 33.655 | 18.240 | 12.220 |
| [4] 2 torch.float32 | 0.009 | 0.112 | 0.318 ***regressed |
| [4] 4 torch.float32 | 0.010 | 0.115 | 0.319 ***regressed |
| [4] 8 torch.float32 | 0.013 | 0.115 | 0.320 ***regressed |
| [4] 16 torch.float32 | 0.027 | 0.120 | 0.331 ***regressed |
| [4] 32 torch.float32 | 0.085 | 0.173 | 0.385 ***regressed |
| [4] 64 torch.float32 | 0.221 | 0.431 | 0.646 ***regressed |
| [4] 128 torch.float32 | 1.102 | 0.834 | 1.055 ***regressed |
| [4] 256 torch.float32 | 4.042 | 1.811 | 2.054 ***regressed |
| [4] 512 torch.float32 | 18.390 | 4.884 | 5.087 ***regressed |
| [4] 1024 torch.float32 | 69.025 | 19.840 | 20.000 ***regressed |
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403
Reviewed By: ailzhang, mruberry
Differential Revision: D23717984
Pulled By: ngimel
fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b