Add FP8 blockwise triton kernel (#2304)
Summary:
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2304
Adding the FP8 blockwise triton kernel. The cutlass counterpart is not quite ready yet.
Reviewed By: xuzhao9
Differential Revision: D58615475
fbshipit-source-id: 1b555dc8b73c0a495f76dcde638f1cfca1b34ab8