onnxruntime
a0ebd5fe - Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149)

Commit

2 years ago

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149) ### Description This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch. ### Motivation and Context The [flash attention v2](https://github.com/Dao-AILab/flash-attention) algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.

References

#20149 - Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking

Author

kunal-vaishnavi

Parents

00244ea1

onnxruntime a0ebd5fe - Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149)

onnxruntime
a0ebd5fe - Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149)