Added cuBLAS path for torch.triangular_solve (#53147)
Summary:
This PR adds the cuBLAS based path for `torch.triangular_solve`
The device dispatching helper function was removed from native_functions.yml, it is replaced with DECLARE/DEFINE_DISPATCH.
`magmaTriangularSolve` is removed and replaced with cuBLAS calls, this is not a BC-breaking change because internally MAGMA just calls the same cuBLAS function and doesn't do anything else.
Batched cuBLAS is faster than batched MAGMA for matrices of size up until 512x512, after that MAGMA is faster. For batches smaller than ~8 and matrix sizes larger than 64x64 a forloop of cuBLAS calls is faster than batched version.
Ref. https://github.com/pytorch/pytorch/issues/47953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53147
Reviewed By: heitorschueroff
Differential Revision: D27007416
Pulled By: mruberry
fbshipit-source-id: ddfc190346e6a56b84145ed0a9af67ca9cde3506