bf16xint16_gemm operator: add --transpose option (#2466)
Summary:
`--transpose` will make this benchmark test a int16 x bf16 mm instead of a bf16 x int16.
This matters for H100, because the wgmma instruction can take registers only on the LHS. So int16 x bf16 is probably the easier one to support efficiently.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2466
Test Plan:
In OSS: ran `python run_benchmark.py triton --op bf16xint16_gemm --transpose`
Internally, ran `buck2 run mode/opt //pytorch/benchmark:triton -- --op bf16xint16_gemm --transpose`
Internally, we run into the issue fixed by https://github.com/triton-lang/triton/pull/4695; but otherwise, they both run.
Reviewed By: aakhundov
Differential Revision: D63294109
Pulled By: davidberard98
fbshipit-source-id: 3ea05bb09e62f51c405ae538726caf80e1ba0d63