Add support for reducing across individual dimensions for 2D matrices using the sum Triton kernel (#2295)
Summary:
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2295
Support reducing a 2-dimensional matrix across one dimension, where the `BLOCK_SIZE` in the reduced dimension is larger than the dimension size. This kernel performs a simplified reduction which assumes that the entire reduction dimension of the tensor fits in a thread block. The implementation handles toggling between block sizes for the `M` and `N` dimensions depending on the reduction dimension. For example, this kernel will reduce across the 0-th dimension for a (M, N) = (16, 16) matrix where `BLOCK_SIZE_M >= 16` and `BLOCK_SIZE_N` is autotuned.
Add a `best_config` metric to find the best `BLOCK_SIZE` for the non-reduction dimension and `num_warps` given some input size.
Reviewed By: jbschlosser
Differential Revision: D58261858
fbshipit-source-id: 8995c91c54e9792b52f4608446e8e940027a604d