Update GQA benchmark to support bfloat16 (#26898)
Update GQA benchmark to support bfloat16 and default to testing the
first configuration (fast mode).
Note that test_sparse_attention.py was removed in
https://github.com/microsoft/onnxruntime/pull/23547. It is referenced by
the benchmark script, so I add it back and disable the test in pipeline
mode.
Example output from H200 GPU:
```
prompt-sm90-Llama3-8B-b1-h32_8x128-float16:
sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV
0 16.0 0.781751 0.571226
1 32.0 0.893813 0.684198
2 64.0 1.434056 1.589263
3 128.0 1.142192 1.681969
4 256.0 1.503483 2.225498
5 512.0 1.045732 1.878660
6 1024.0 2.334924 0.916745
7 2048.0 2.229924 3.001290
8 4096.0 4.309678 3.198855
9 8192.0 7.932211 7.910411
token-sm90-Llama3-8B-b1-h32_8_d128-float16:
past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV
0 16.0 1.751966 0.780081
1 32.0 1.302806 0.043939
2 64.0 2.301024 2.207282
3 128.0 2.294556 3.010107
4 256.0 2.931330 1.781768
5 512.0 1.210220 2.799579
6 1024.0 2.767142 2.660434
7 2048.0 1.420229 0.091433
8 4096.0 0.860655 0.801022
9 8191.0 0.749525 0.820858
prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16:
sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV
0 16.0 1.085427 0.666664
1 32.0 1.714795 0.931262
2 64.0 1.729093 1.438733
3 128.0 1.071263 2.486135
4 256.0 1.957349 1.342417
5 512.0 1.159680 1.591321
6 1024.0 0.743702 2.035150
7 2048.0 1.452736 1.788801
8 4096.0 4.029917 4.041565
9 8192.0 7.934485 7.931600
token-sm90-Llama3-8B-b1-h32_8_d128-bfloat16:
past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV
0 16.0 0.044354 0.043983
1 32.0 0.040715 0.044061
2 64.0 0.045586 0.044071
3 128.0 0.062204 0.061418
4 256.0 0.074764 4.874854
5 512.0 2.472094 2.102259
6 1024.0 4.911269 1.396149
7 2048.0 4.898032 1.684034
8 4096.0 2.523432 2.192279
9 8191.0 1.651366 3.427370
```