Fix colfax_cutlass flash_attention operator (#2401)
Summary:
colfax_cutlass kernels will fail because of C++ template instantiation.
We need to explicitly include the header file to instantiate all template parameters.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2401
Test Plan:
Install the colfax_cutlass operators:
```
python install.py --userbenchmark triton --cutlass
/home/xz/git/benchmark/submodules/cutlass-kernels/src/fmha/fmha_forward.cu(826): warning https://github.com/pytorch/benchmark/issues/117-D: non-void function "main" should return a value
return;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/home/xz/git/benchmark/submodules/cutlass-kernels/src/fmha/fmha_forward.cu(826): warning https://github.com/pytorch/benchmark/issues/117-D: non-void function "main" should return a value
return;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
```
Run the flash_attention operator from colfax_cutlass
```
python run_benchmark.py triton --op flash_attention --only colfax_cutlass --num-inputs 1
(Batch, Heads, SeqLen, Dhead) colfax_cutlass-latency
------------------------------- ------------------------
(32, 32, 512, 64) 0.001024
```
Reviewed By: manman-ren
Differential Revision: D60557212
Pulled By: xuzhao9
fbshipit-source-id: 25b216f850d2e82815041059d372627806bfd3ca