pytorch
4e042cfe - Improve triton bsr_dense_mm performance on column-major ordered inputs with float32 dtype (#108512)

Commit View On GitHub

Commit

1 year ago

Improve triton bsr_dense_mm performance on column-major ordered inputs with float32 dtype (#108512) As in the title. The bsr_dense_mm performance on inputs using column-major storage order is relevant for `linear(x, W)` operation that for BSR weights is defined as `bsr_dense_mm(W, x.transpose(-2, -1)).transpose(-2, 1)` so that the second argument to `bse_dense_mm` is a strided tensor using column-major storage order when `x` is C-contiguous. For large inputs (size > 1000) and moderate sparsity in the BSR input, the speed up can be more than 3 times, as illustrated in the following figure (raw data: [bench_bsr_dense_mm_1_results.txt](https://github.com/pytorch/pytorch/files/12512245/bench_bsr_dense_mm_1_results.txt)): ![bench_bsr_dense_mm_1](https://github.com/pytorch/pytorch/assets/402156/c6372008-dfae-4d26-b119-2c3c944a74ae) For small inputs (size=512), there exists a slight degradation of performance. For row-major ordered inputs, there is no change in performance (see raw data above). For inputs with float16 dtype, there is no considerable change in performance (see blue marks in the figure). Pull Request resolved: https://github.com/pytorch/pytorch/pull/108512 Approved by: https://github.com/cpuhrsch

Author

pearu

Committer

pytorchmergebot

Parents

1dabfb68

pytorch 4e042cfe - Improve triton bsr_dense_mm performance on column-major ordered inputs with float32 dtype (#108512)

Commit

pytorch
4e042cfe - Improve triton bsr_dense_mm performance on column-major ordered inputs with float32 dtype (#108512)