llama.cpp
llama : custom attention mask + parallel decoding + no context swaps
#3228
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
57
Changes
View On
GitHub
llama : custom attention mask + parallel decoding + no context swaps
#3228
ggerganov
merged 57 commits into
master
from
custom-attention-mask
tests : verify that RoPE is "additive"
c5df72e8
llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask)
3b4bab6a
ggml : ggml_rope now takes a vector with positions instead of n_past
1fb033fd
ggerganov
force pushed
from
d4cd2633
to
1fb033fd
2 years ago
ggerganov
commented on 2023-09-17
metal : add rope_f16 kernel + optimize cpy kernels
fad56936
ggerganov
force pushed
from
57cea733
to
fad56936
2 years ago
llama : unified KV cache + batch inference API
d29e7693
Merge branch 'master' into custom-attention-mask
58bb5110
ggerganov
added
high priority
ggerganov
added
need feedback
llama : add new llama_decode() API that works with llama_batch
9f42e754
llama : add cell_max heuristic for more efficient kv_cache
6952a460
llama : extend llama_kv_cache API
4d76d762
llama : more robust cell_max heuristic + wip shift
f015b266
metal : disable concurrency optimization
86c90e34
llama : add llama_kv_cache_shift_seq + no more context swaps
0cbf3bfe
ggerganov
force pushed
from
6289ed6b
to
0cbf3bfe
2 years ago
llama : apply K-cache roping for Falcon and Baichuan
7c1bdd0e
speculative : fix KV cache management
1f17ea63
ggerganov
force pushed
from
976ff05a
to
5bda9e27
2 years ago
parallel : example for serving multiple users in parallel
0161372b
ggerganov
force pushed
from
5bda9e27
to
0161372b
2 years ago
parallel : disable hot-plug to avoid cache fragmentation
466b5138
fixes : speculative KV cache + llama worst-case graph
897caccd
llama : extend batch API to select which logits to output
fa0e6778
llama : fix worst case graph build
daf4c6d3
ggml-cuda : update rope implementation for parallel decoding (#3254)
7e2b9974
make : add parallel to build + fix static functions in llama.cpp
25bd2540
simple : fix token counting
467e3079
parallel : various improvements
36714e16
llama : fix cell_max logic + rename functions
ddad2277
parallel : try smaller batches when the KV cache is fragmented
806d397c
parallel : fix sequence termination criteria
16090a5d
llama : silence errors KV cache errors
d37081ae
ggerganov
force pushed
from
7fa23d75
to
d37081ae
2 years ago
parallel : remove new line from prompt
82e20e9b
parallel : process system prompt once + configurable paramters + llam…
4b5f3cd6
parallel : remove question with short answers
8a9aca37
parallel : count cache misses
eed3fd42
parallel : print misses on each request
6028879f
parallel : minor
7b7472ee
llama : fix n_kv to never become 0
e1067efb
parallel : rename hot-plug to continuous-batching
a1327c71
llama : improve llama_batch API + simplify parallel example
addae65f
ggerganov
force pushed
from
464720fc
to
addae65f
2 years ago
simple : add parallel decoding support
b377bf22
simple : improve comments + free batch
db0fc2da
ggml-cuda : add rope f16, restore performance with parallel decoding …
e04dc519
llama : disable MPI for now
54206962
train : make KQ_pos memory buffer permanent via dummy scale op
2f3a46fc
ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275)
1be2b8c1
ggerganov
marked this pull request as ready for review
2 years ago
parallel : fix bug (extra BOS) + smaller token_prev array
ee1d670c
parallel : fix cases where the input prompts can overflow the batch
ded9b43c
parallel : add disabled experimental batch chunking in powers of two
b2debf65
llama : llama.h formatting + comments
5a3369d8
simple : add README.md
88451600
slaren
commented on 2023-09-26
llama : fix kv cache heuristic when context is less than 32
c1596f63
Merge branch 'master' into custom-attention-mask
25856900
parallel : fix crash when `-n -1`
4ad06769
llama : simplify returns if/else branches
e9463792
metal : use mm kernels for batch size > 2
4c72ab13
examples : utilize new llama_get_logits_ith()
d008733e
examples : add example for batched decoding
a2075615
examples : do not eval prompt 2 times (close #3348)
2b8830af
server : clear the KV cache beyond n_past before llama_decode
ce2d995a
slaren
approved these changes on 2023-09-28
server : avoid context swaps by shifting the KV cache
c5650ed4
ggerganov
merged
ec893798
into master
2 years ago
cebtenzzre
commented on 2023-10-05
Login to write a write a comment.
Login via GitHub
Reviewers
slaren
cebtenzzre
xaedes
Assignees
No one assigned
Labels
high priority
need feedback
Milestone
No milestone
Login to write a write a comment.
Login via GitHub