Use cusolver potrs as the backend of cholesky_inverse for batch_size == 1 on CUDA (#54676)
Summary:
This PR adds the functionality to use cusolver potrs as the backend of cholesky_inverse for batch_size == 1 on CUDA.
Cusolver `potri` is **not** used, because
- it only returns the upper or lower triangular matrix as a result. Although the other half is zero, we may still need extra kernels to get the full Hermitian matrix
- it's no faster than cusolver potrs in most cases
- it doesn't have a batched version or 64-bit version
`cholesky_inverse` dispatch heuristics:
- If magma is not installed, or batch_size is 1, dispatch to `cusolverDnXpotrs` (64 bit) and `cusolverDn<T>potrs` (legacy).
- Otherwise, use magma.
See also https://github.com/pytorch/pytorch/issues/42666 #47953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54676
Reviewed By: ngimel
Differential Revision: D27723805
Pulled By: mruberry
fbshipit-source-id: f65122812c9e56a781aabe4d87ed28b309abf93f