Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149)
### Description
This PR adds flash attention v2 and support for INT4 CUDA benchmarking
in PyTorch.
### Motivation and Context
The [flash attention v2](https://github.com/Dao-AILab/flash-attention)
algorithm helps improve model performance in PyTorch. Support for INT4
CUDA in PyTorch is done through the
[`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.