Triton Partition K gemm to TritonBench
Summary: This an early exploration. Triton Partition K ([link](https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md#parallelized-reductions)) used two kernels (gemm + reduce) to achieve the goal of splitK. Comparing with the `atomic_add` Triton GEMM, the partitionK is more friendly to epilogue fusion
Reviewed By: bertmaher, chenyang78
Differential Revision: D59948589
fbshipit-source-id: a2118947f8e20ab17d26843fd263b83e22f58541