pytorch
d75c4027 - Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)

Commit
4 years ago
Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403) Summary: Fixes https://github.com/pytorch/pytorch/issues/42265 This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes. Specifically, when * the tensor is two dimensional (single batch), or * has >2 dimensions (multiple batches) and `batch_size <= 2`, or * magma is not linked, cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used. https://github.com/pytorch/pytorch/blob/8c0949ae454b1d2c1b626a5ea19ba5ea6487d305/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu#L742-L752 The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl. On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA. https://github.com/pytorch/pytorch/blob/060769feaf02db56ac79e0c73dab1105828ece69/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h#L10-L13 Note that there is a new heuristic used before cusolver/cublas calls here: https://github.com/pytorch/pytorch/blob/8c0949ae454b1d2c1b626a5ea19ba5ea6487d305/aten/src/ATen/native/cuda/MiscUtils.h#L113-L121 where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma). Checklist: - [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver) - [X] Rewrite single inverse (ndim == 2) with cusolver - [X] Rewrite batched inverse (ndim > 2) with cublas - [X] Add cusolver to build - [x] Clean up functions related to `USE_MAGMA` define guard - [x] Workaround for non-cuda platform - [x] Workaround for cuda 9.2 - [x] Add zero size check - [x] Add tests Next step: If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance. <details> <summary> benchmark 73499c6 </summary> benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb shape meaning: * `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)` * `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)` | shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) | | --- | --- | --- | --- | | [] 2 torch.float32 | 0.095 | 7.534 | 0.129 | | [] 4 torch.float32 | 0.009 | 7.522 | 0.129 | | [] 8 torch.float32 | 0.011 | 7.647 | 0.138 | | [] 16 torch.float32 | 0.075 | 7.582 | 0.135 | | [] 32 torch.float32 | 0.073 | 7.573 | 0.191 | | [] 64 torch.float32 | 0.134 | 7.694 | 0.288 | | [] 128 torch.float32 | 0.398 | 8.073 | 0.491 | | [] 256 torch.float32 | 1.054 | 11.860 | 1.074 | | [] 512 torch.float32 | 5.218 | 14.130 | 2.582 | | [] 1024 torch.float32 | 19.010 | 18.780 | 6.936 | | [1] 2 torch.float32 | 0.009 | 0.113 | 0.128 ***regressed | | [1] 4 torch.float32 | 0.009 | 0.113 | 0.131 ***regressed | | [1] 8 torch.float32 | 0.011 | 0.116 | 0.129 ***regressed | | [1] 16 torch.float32 | 0.015 | 0.122 | 0.135 ***regressed | | [1] 32 torch.float32 | 0.032 | 0.177 | 0.178 ***regressed | | [1] 64 torch.float32 | 0.070 | 0.420 | 0.281 | | [1] 128 torch.float32 | 0.328 | 0.816 | 0.490 | | [1] 256 torch.float32 | 1.125 | 1.690 | 1.084 | | [1] 512 torch.float32 | 4.344 | 4.305 | 2.576 | | [1] 1024 torch.float32 | 16.510 | 16.340 | 6.928 | | [2] 2 torch.float32 | 0.009 | 0.113 | 0.186 ***regressed | | [2] 4 torch.float32 | 0.011 | 0.115 | 0.184 ***regressed | | [2] 8 torch.float32 | 0.012 | 0.114 | 0.184 ***regressed | | [2] 16 torch.float32 | 0.019 | 0.119 | 0.173 ***regressed | | [2] 32 torch.float32 | 0.050 | 0.170 | 0.240 ***regressed | | [2] 64 torch.float32 | 0.120 | 0.429 | 0.375 | | [2] 128 torch.float32 | 0.576 | 0.830 | 0.675 | | [2] 256 torch.float32 | 2.021 | 1.748 | 1.451 | | [2] 512 torch.float32 | 9.070 | 4.749 | 3.539 | | [2] 1024 torch.float32 | 33.655 | 18.240 | 12.220 | | [4] 2 torch.float32 | 0.009 | 0.112 | 0.318 ***regressed | | [4] 4 torch.float32 | 0.010 | 0.115 | 0.319 ***regressed | | [4] 8 torch.float32 | 0.013 | 0.115 | 0.320 ***regressed | | [4] 16 torch.float32 | 0.027 | 0.120 | 0.331 ***regressed | | [4] 32 torch.float32 | 0.085 | 0.173 | 0.385 ***regressed | | [4] 64 torch.float32 | 0.221 | 0.431 | 0.646 ***regressed | | [4] 128 torch.float32 | 1.102 | 0.834 | 1.055 ***regressed | | [4] 256 torch.float32 | 4.042 | 1.811 | 2.054 ***regressed | | [4] 512 torch.float32 | 18.390 | 4.884 | 5.087 ***regressed | | [4] 1024 torch.float32 | 69.025 | 19.840 | 20.000 ***regressed | </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403 Reviewed By: ailzhang, mruberry Differential Revision: D23717984 Pulled By: ngimel fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
Author
Parents
Loading