Added cuSOLVER path for torch.linalg.eigh/eigvalsh (#53040)
Summary:
This PR adds the cuSOLVER based path for `torch.linalg.eigh/eigvalsh`.
The device dispatching helper function was removed from native_functions.yml, it is replaced with `DECLARE/DEFINE_DISPATCH`.
cuSOLVER is used if CUDA version >= 10.1.243. In addition if CUDA version >= 11.1 (cuSOLVER version >= 11.0) then the new 64-bit API is used.
I compared cuSOLVER's `syevd` vs MAGMA's `syevd`. cuSOLVER is faster than MAGMA for all matrix sizes.
I also compared cuSOLVER's `syevj` (Jacobi algorithm) vs `syevd` (QR based divide-and-conquer algorithm). Despite it is said that `syevj` is better than `syevd` for smaller matrices, in my tests it is the case only for float32 dtype and matrix sizes 32x32 - 512x512.
For batched inputs comparing a for loop of `syevd/syevj` calls to `syevjBatched` shows that for batches of matrices up to 32x32 the batched routine is much better. However, there are bugs in `syevjBatched`, sometimes it doesn't compute the result leaving eigenvectors as a unit diagonal matrix and eigenvalues as the real diagonal of the input matrix. The output is the same with `cupy.cusolver.syevj` so the problem is definitely on the cuSOLVER side. This bug is not present in the non-batched `syevj`.
The performance of 64-bit `syevd` is the same as 32-bit version.
Ref. https://github.com/pytorch/pytorch/issues/47953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53040
Reviewed By: H-Huang
Differential Revision: D27401218
Pulled By: mruberry
fbshipit-source-id: aef91eefb57ed73fef87774ff9a36d50779903f7