onnxruntime
a0ebd5fe - Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149)

Commit
1 year ago
Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149) ### Description This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch. ### Motivation and Context The [flash attention v2](https://github.com/Dao-AILab/flash-attention) algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.
Parents
Loading