Add ThunderKittens support (#2370)

Commit

1 year ago

Add ThunderKittens support (#2370) Summary: Register TK kernels as torch custom ops. Pull Request resolved: https://github.com/pytorch/benchmark/pull/2370 Test Plan: To install ThunderKittens: ``` $ python install.py --userbenchmark triton --tk nvcc -I/home/xz/miniconda3/lib/python3.11/site-packages/torch/include -I/home/xz/git/benchmark/submodules/ThunderKittens/examples/attn/h100 -I/home/xz/git/benchmark/submodules/ThunderKittens --use_fast_math --generate-line-info --restrict -std=c++20 --expt-relaxed-constexpr --expt-extended-lambda -forward-unknown-to-host-compiler -Xcompiler=-fno-strict-aliasing -D_GLIBCXX_USE_CXX11_ABI=0 -MD -MT -MF -x cu -lrt -lpthread -ldl -lcuda -lcudadevrt -lcudart_static -lcublas -DKITTENS_HOPPER -arch=sm_90a /home/xz/git/benchmark/userbenchmark/triton/tk/src/attn_h100_fwd/register_op.cu -o tk_attn_h100_fwd.so --shared -fPIC -L/home/xz/miniconda3/lib/python3.11/site-packages/torch/lib -ltorch -ltorch_cuda -lc10 -lc10_cuda /home/xz/git/benchmark/userbenchmark/triton/tk/.data ``` To run TK kernel on H100: ``` python run_benchmark.py triton --op flash_attention --only triton_tutorial_flash_v2,tk,flash_v3 --metrics tflops (Batch, Heads, SeqLen, Dhead) triton_tutorial_flash_v2-tflops tk-tflops flash_v3-tflops ------------------------------- --------------------------------- ----------- ----------------- (32, 32, 512, 64) 290.005 258.142 288.021 (16, 32, 1024, 64) 323.831 299.238 325.056 (8, 32, 2048, 64) 339.322 327.398 375.369 (4, 32, 4096, 64) 347.665 334.122 395.358 (2, 32, 8192, 64) 350.095 340.935 415.751 (1, 32, 16384, 64) 353.185 346.928 415.99 ``` Reviewed By: manman-ren Differential Revision: D60562354 Pulled By: xuzhao9 fbshipit-source-id: d7a3124f48b25fb7a8a3ebd9fa44fee709083f5a

Author

xuzhao9

Committer

facebook-github-bot

Parents

043b07d4

benchmark e05b6fa9 - Add ThunderKittens support (#2370)

benchmark
e05b6fa9 - Add ThunderKittens support (#2370)