benchmark
d5f0a12b - Add colfax_cutlass backend to flash_attention operator (#2296)

Commit

1 year ago

Add colfax_cutlass backend to flash_attention operator (#2296) Summary: Add colfax_cutlass kernel compilation: ``` $ python install.py --userbenchmark triton --cutlass ``` Run with sdpa, triton_tutorial_flash_v2, and colfax_cutlass on H100: ``` $ python run_benchmark.py triton --op flash_attention --only sdpa,triton_tutorial_flash_v2,colfax_cutlass --batch 128 --input-id 3 --num-inputs 5 --n-heads 8 --d-head 128 --metrics latency,tflops SeqLen sdpa-latency sdpa-tflops triton_tutorial_flash_v2-latency triton_tutorial_flash_v2-tflops colfax_cutlass-latency colfax_cutlass-tflops -------- -------------- ------------- ---------------------------------- --------------------------------- ------------------------ ----------------------- 1024 1.91248 287.457 1.55574 353.372 1.38538 396.828 2048 7.49987 293.208 5.70656 385.35 5.4792 401.34 4096 29.4748 298.428 21.7369 404.662 20.8335 422.21 8192 122.297 287.696 85.1293 413.305 82.3884 427.055 16384 462.649 304.199 334.992 420.122 328.363 428.604 ``` Pull Request resolved: https://github.com/pytorch/benchmark/pull/2296 Reviewed By: aaronenyeshi Differential Revision: D58671502 Pulled By: xuzhao9 fbshipit-source-id: 38cba58463c6783c535eda3c11e5a75707ef9730

Author

xuzhao9

Committer

facebook-github-bot

Parents

48223b8f

benchmark d5f0a12b - Add colfax_cutlass backend to flash_attention operator (#2296)

benchmark
d5f0a12b - Add colfax_cutlass backend to flash_attention operator (#2296)