tritonbench bf16xint16 matmul template (#2348)
Summary:
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2348
Overall context: Before looking further into the bf16xint4 matmul, I'm planning to look into a bf16xint16 matmul first. The idea of this matmul is that it will just be the same as a bf16xbf16 matmul, except the second operand needs to be casted from int16 to bf16 in the triton kernel before executing.
This PR: is NOT fully functional yet. It's just implemented this way to make review easier.
There's 3 kernels that will be benchmarked here:
1. bf16xbf16 triton kernel - I've selected this kernel as the "baseline" because, ideally, we'd like the bf16xint16 kernel to be as close as possible to this kernel.
2. bf16xint16 triton kernel - this is NOT implemented yet, will be implemented in the follow-up PR.
3. bf16x(convert(int16 -> bf16)) triton kernel - i.e. convert the int16->bf16, write to global memory, and then run the bf16xbf16 kernel.
Differential Revision:
D59234085
imported-using-ghimport
D59234085
Test Plan: Imported from OSS
Reviewed By: xuzhao9
Pulled By: davidberard98
fbshipit-source-id: 75a493dbd78ee1aa1f63926f6dd61a2e7388816c