Sparse CSR CUDA: add `triangular_solve_out` (#61858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61858
This PR adds `triangular_solve_out_sparse_csr_cuda`. The operation is
used to comput the solution to the linear system where coefficient
matrix is triangular.
Structured kernels are used and the meta function needed some changes to
support sparse csr layout. With sparse matrix input the `cloned_coefficient`
tensor is 0-sized tensor.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D31948435
Pulled By: cpuhrsch
fbshipit-source-id: 7775fece83ca705a26d75f82aead10b956b14bfd