LLaMA Model Optimization #18021
Initial fusions and kernel changes for LLaMA
e74b899e
Add rotary embeddings for LLaMA
228de8ca
Change input shapes and types for fused model
dc16e164
Add present kv to multi-head attention
816f7e94
Merge branch 'main' into kvaishnavi/llama
5ce8e5a6
Update benchmark scripts
6669899b
Update inputs for optimized model
ed61ae48
Merge branch 'main' into kvaishnavi/llama
cdbd4664
Add interleaved and non-interleaved rotary embeddings
becbd302
Update rotary embeddings and export scripts
eece5e82
Fix attention mask for HF version
55d05547
Modify rotary embeddings fusion for merged HF model
37e6b5fd
Add optimization passes after conversion
909f8e76
Fix adding GQA to optimized model
43f459bb
Add CPU implementation for rotary embeddings
4e2bf415
Add test cases
2210c476
Clean up test cases
6f154e30
Fix initializer data in test case
822c2e60
Add merged export
cdf55360
Remove logger warning
52f59949
Update docs
0d176567
Enable buffer sharing and int4 quantization
bcb5a32d
Fix inputs for buffer sharing
8ae9188c
Remove extra print
143d8057
Clean up code
f2b46448
Merge branch 'main' into kvaishnavi/llama
d7bb72c9
Address PR feedback
8968bb3d
Add changes suggested by linters
84f7cc09
Fix min CUDA architecture
99ec3410
Add graph input for GQA
b76e2c2b
Fix GQA parity issue
edafef50
Add changes suggested by linter
7b829122
Remove unreferenced parameter
a8913986
Change rotary embedding test threshold
716b7253
Add int4 CPU support
6b8698d4
Add changes suggested by linters
cc0199b2
Merge branch 'main' into kvaishnavi/llama
e38ecb3b
Fix linter issue
e69c23b5
Fix CodeQL error
d14d5bdb
tianleiwu
approved these changes
on 2023-10-23
faxu
added triage:approved
faxu
added sdxl_llama
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub