Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149
Enable flash attention v2 for PyTorch models when benchmarking
0fce15e0
Add instructions for installing flash attention v2
701d5f3b
Add INT4 CUDA benchmarking for PyTorch eager
15f0ab6a
Add instructions for installing PyTorch quantization
3232e42d
Use flash attention v2 for CUDA and SDPA for CPU
3e7b79e6
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub