Improve triton bsr_dense_mm performance on column-major ordered inputs with float32 dtype (#108512)
As in the title.
The bsr_dense_mm performance on inputs using column-major storage order is relevant for `linear(x, W)` operation that for BSR weights is defined as `bsr_dense_mm(W, x.transpose(-2, -1)).transpose(-2, 1)` so that the second argument to `bse_dense_mm` is a strided tensor using column-major storage order when `x` is C-contiguous.
For large inputs (size > 1000) and moderate sparsity in the BSR input, the speed up can be more than 3 times, as illustrated in the following figure (raw data: [bench_bsr_dense_mm_1_results.txt](https://github.com/pytorch/pytorch/files/12512245/bench_bsr_dense_mm_1_results.txt)):
![bench_bsr_dense_mm_1](https://github.com/pytorch/pytorch/assets/402156/c6372008-dfae-4d26-b119-2c3c944a74ae)
For small inputs (size=512), there exists a slight degradation of performance.
For row-major ordered inputs, there is no change in performance (see raw data above).
For inputs with float16 dtype, there is no considerable change in performance (see blue marks in the figure).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108512
Approved by: https://github.com/cpuhrsch