(bsr/csr) x dense mm (#85551)
As per title. This implementation is not the most optimal and could be improved albeit with native kernels (i.e. block matching need not be materialized).
Compared to existing kernels it offers:
- Half float support (In fact, any dtype that supports `matmul` will work).
- Arbitrary block sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85551
Approved by: https://github.com/amjames, https://github.com/cpuhrsch