Extend support to varying block sizes on both dimensions for 2D matrices (#2302)
Summary:
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2302
Extend support for reducing across individual dimensions on 2-dimensional matrices by allowing for varying block sizes on both the `M` (first) and `N` (second) dimensions.
The existing kernel performed a simplified reduction, assuming that the entire reduction dimension fit within one thread block. The new kernel implementation removes the need for this assumption, allowing both the reduction and the non-reduction dimensions to fit in multiple thread blocks. This implementation also enables autotuning on block sizes for both the `M` and `N` dimensions.
For 1D results, add a `sum_then_buffer` configuration which decides which kernel configuration to run. `Sum_then_buffer` sums individual blocks of input and adds these sums into a buffer. `Buffer_then_sum` adds blocks of raw input into a buffer, then reduces the buffer.
Reviewed By: davidberard98
Differential Revision: D58313958
fbshipit-source-id: 639ea6b7d7b92f478c0f5669a1cdc0dcb68004e3