Sparse CSR CUDA: add `addmv_out` (#61407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D31584499
Pulled By: ngimel
fbshipit-source-id: 4c507791471ada88969116b88eeaaba7a7536431