Add HSTU ragged attention operator (#2453)
Summary:
As the title says.
On H100:
```
$ python run_benchmark.py triton --op ragged_attention
x_val hstu_triton_ragged_attention-latency hstu_triton_ragged_attention_persistent-latency
----------------- -------------------------------------- -------------------------------------------------
(8, 4, 512, 2048) 0.0141706 0.0128713
(8, 4, 512, 2048) 0.0187315 0.0171204
(8, 4, 512, 2048) 0.0156807 0.0155399
(8, 4, 512, 2048) 0.0165724 0.0154679
(8, 4, 512, 2048) 0.0163886 0.0157738
(8, 4, 512, 2048) 0.0173378 0.0155991
(8, 4, 512, 2048) 0.0164874 0.0153128
(8, 4, 512, 2048) 0.0203275 0.0172193
(8, 4, 512, 2048) 0.0214526 0.0185414
(8, 4, 512, 2048) 0.0172307 0.0169625
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2453
Reviewed By: manman-ren
Differential Revision: D62513596
Pulled By: xuzhao9
fbshipit-source-id: 154ef0145ca94ecfeb0b075c9dec01b395683ef2