Add TMA+persistent bf16 gemm (#2373)
Summary:
Add persistent and TMA-persistent matmul variants from the persistent matmul tutorial: https://github.com/triton-lang/triton/blob/main/python/tutorials/09-persistent-matmul.py. Note that these aren't autotuned, so we might get bad results for small or odd shapes. Also add a "TMA cached" variant, which caches the TMA descriptors during benchmarking, to avoid measuring the HtoD overhead of setting up the TMA.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2373
Reviewed By: manman-ren, chenyang78
Differential Revision: D59648079
Pulled By: bertmaher
fbshipit-source-id: 4d1dca591bdde7709659339acfb0f6a952c5c02d