Add colfax_cutlass backend to flash_attention operator (#2296)
Summary:
Add colfax_cutlass kernel compilation:
```
$ python install.py --userbenchmark triton --cutlass
```
Run with sdpa, triton_tutorial_flash_v2, and colfax_cutlass on H100:
```
$ python run_benchmark.py triton --op flash_attention --only sdpa,triton_tutorial_flash_v2,colfax_cutlass --batch 128 --input-id 3 --num-inputs 5 --n-heads 8 --d-head 128 --metrics latency,tflops
SeqLen sdpa-latency sdpa-tflops triton_tutorial_flash_v2-latency triton_tutorial_flash_v2-tflops colfax_cutlass-latency colfax_cutlass-tflops
-------- -------------- ------------- ---------------------------------- --------------------------------- ------------------------ -----------------------
1024 1.91248 287.457 1.55574 353.372 1.38538 396.828
2048 7.49987 293.208 5.70656 385.35 5.4792 401.34
4096 29.4748 298.428 21.7369 404.662 20.8335 422.21
8192 122.297 287.696 85.1293 413.305 82.3884 427.055
16384 462.649 304.199 334.992 420.122 328.363 428.604
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2296
Reviewed By: aaronenyeshi
Differential Revision: D58671502
Pulled By: xuzhao9
fbshipit-source-id: 38cba58463c6783c535eda3c11e5a75707ef9730