[indcutor] add one triton config for reduction (#106925)
This config found by coordinate descent tuning improves kernel https://gist.github.com/shunting314/189a8ef69f90db9d614a823385147a72 from
- 10.008ms 5.993GB 598.83GB/s
to
- 6.170ms 5.993GB 971.28GB/s .
It should only affect reduction with hint ReductionHint.DEFAULT or when max autotune is enabled.
(It's funny that before I upgrade my triton version, the improvement is from 9.076ms -> 5.692ms )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106925
Approved by: https://github.com/jansel