PR #4309 llama : per-layer KV cache

llama : per-layer KV cache #4309

ggerganov merged 15 commits into master from gg/per-layer-kv

per-layer KV

e9bcf66a

remove unnecessary copies

55f2f2fb

less code duplication, offload k and v separately

f4f9367f

Merge branch 'master' into per-layer-kv

c294c78e

llama : offload KV cache per-layer

986b3da7

llama : offload K shift tensors

f3dbfb9f

llama : offload for rest of the model arches

3d3e6bd0

ggerganov added performance

ggerganov added need feedback

slaren commented on 2023-12-03

ggerganov commented on 2023-12-03

llama : enable offload debug temporarily

1fa91a48

llama : keep the KV related layers on the device

c44bc1ee

llama : remove mirrors, perform Device -> Host when partial offload

c80b8a2b

common : add command-line arg to disable KV cache offloading

e262947d

llama : update session save/load

66aaac98

llama : support quantum K cache (#4312)

1a1a1c38

ggerganov added breaking change

Merge branch 'master' into gg/per-layer-kv

680a99e7

readme : add API change notice

fc5f3346

ggerganov merged bcc0eb45 into master 1 year ago

Reviewers

slaren

kalomaze

Assignees

No one assigned

Labels

performance breaking change need feedback

Milestone

No milestone

llama.cpp llama : per-layer KV cache #4309 Merged

llama : per-layer KV cache #4309

llama.cpp
llama : per-layer KV cache
#4309

Merged