onnxruntime
Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking
#20149
Merged

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149

kunal-vaishnavi
kunal-vaishnavi Enable flash attention v2 for PyTorch models when benchmarking
0fce15e0
kunal-vaishnavi Add instructions for installing flash attention v2
701d5f3b
kunal-vaishnavi Add INT4 CUDA benchmarking for PyTorch eager
15f0ab6a
kunal-vaishnavi Add instructions for installing PyTorch quantization
3232e42d
kunal-vaishnavi kunal-vaishnavi added release:1.17.3
hanbitmyths
hanbitmyths commented on 2024-03-29
kunal-vaishnavi Use flash attention v2 for CUDA and SDPA for CPU
3e7b79e6
hanbitmyths
hanbitmyths commented on 2024-03-29
hanbitmyths
hanbitmyths approved these changes on 2024-03-29
kunal-vaishnavi kunal-vaishnavi merged a0ebd5fe into main 2 years ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone