Add ThunderKittens support (#2370)
Summary:
Register TK kernels as torch custom ops.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2370
Test Plan:
To install ThunderKittens:
```
$ python install.py --userbenchmark triton --tk
nvcc -I/home/xz/miniconda3/lib/python3.11/site-packages/torch/include -I/home/xz/git/benchmark/submodules/ThunderKittens/examples/attn/h100 -I/home/xz/git/benchmark/submodules/ThunderKittens --use_fast_math --generate-line-info --restrict -std=c++20 --expt-relaxed-constexpr --expt-extended-lambda -forward-unknown-to-host-compiler -Xcompiler=-fno-strict-aliasing -D_GLIBCXX_USE_CXX11_ABI=0 -MD -MT -MF -x cu -lrt -lpthread -ldl -lcuda -lcudadevrt -lcudart_static -lcublas -DKITTENS_HOPPER -arch=sm_90a /home/xz/git/benchmark/userbenchmark/triton/tk/src/attn_h100_fwd/register_op.cu -o tk_attn_h100_fwd.so --shared -fPIC -L/home/xz/miniconda3/lib/python3.11/site-packages/torch/lib -ltorch -ltorch_cuda -lc10 -lc10_cuda
/home/xz/git/benchmark/userbenchmark/triton/tk/.data
```
To run TK kernel on H100:
```
python run_benchmark.py triton --op flash_attention --only triton_tutorial_flash_v2,tk,flash_v3 --metrics tflops
(Batch, Heads, SeqLen, Dhead) triton_tutorial_flash_v2-tflops tk-tflops flash_v3-tflops
------------------------------- --------------------------------- ----------- -----------------
(32, 32, 512, 64) 290.005 258.142 288.021
(16, 32, 1024, 64) 323.831 299.238 325.056
(8, 32, 2048, 64) 339.322 327.398 375.369
(4, 32, 4096, 64) 347.665 334.122 395.358
(2, 32, 8192, 64) 350.095 340.935 415.751
(1, 32, 16384, 64) 353.185 346.928 415.99
```
Reviewed By: manman-ren
Differential Revision: D60562354
Pulled By: xuzhao9
fbshipit-source-id: d7a3124f48b25fb7a8a3ebd9fa44fee709083f5a