gemm: fix triton.ops and be consistent with m/n/k ordering (#2350)
Summary:
triton.ops is no more; the implementation is in triton-lang/kernels, which I've copypasted here to avoid taking a dependence.
Also we were not consistent in how we read m/n/k from datasets. I've stuck with the m/n/k ordering since nvidia tends to use that in their descriptions of tensor core ops.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2350
Test Plan:
```
python run_benchmark.py triton --op gemm
```
Reviewed By: int3
Differential Revision: D59242936
Pulled By: bertmaher
fbshipit-source-id: dde6fe8da90f01d642a10dd6a8efe4aa9396f1ba