Sparse CSR: Fix sampled_addmm for noncontiguous inputs and fix block sparse triangular solve
`torch.sparse.sampled_addmm` was incorrect for noncontiguous inputs on CUDA.
Unfortnately, it was overlooked in the tests that noncontiguous inputs
are not tested properly because 1x5, 5x1 shapes were used.
Block sparse triangular solver on CUDA could return incorrect results if
there's a zero on the diagonal in the sparse matrix. Now it returns nan.
Tests also revealed that unitriangular=True flag is not working
correctly on CPU in some cases. That part needs more investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76590
Approved by: https://github.com/cpuhrsch