llama.cpp
llama : custom attention mask + parallel decoding + no context swaps
#3228
Merged

llama : custom attention mask + parallel decoding + no context swaps #3228

ggerganov merged 57 commits into master from custom-attention-mask
ggerganov
ggerganov tests : verify that RoPE is "additive"
c5df72e8
ggerganov llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask)
3b4bab6a
ggerganov ggml : ggml_rope now takes a vector with positions instead of n_past
1fb033fd
ggerganov ggerganov force pushed from d4cd2633 to 1fb033fd 2 years ago
ggerganov
ggerganov commented on 2023-09-17
cebtenzzre
ggerganov
ggerganov metal : add rope_f16 kernel + optimize cpy kernels
fad56936
ggerganov ggerganov force pushed from 57cea733 to fad56936 2 years ago
ggerganov llama : unified KV cache + batch inference API
d29e7693
ggerganov Merge branch 'master' into custom-attention-mask
58bb5110
ggerganov ggerganov added high priority
ggerganov ggerganov added need feedback
ggerganov llama : add new llama_decode() API that works with llama_batch
9f42e754
ggerganov llama : add cell_max heuristic for more efficient kv_cache
6952a460
ggerganov llama : extend llama_kv_cache API
4d76d762
ggerganov llama : more robust cell_max heuristic + wip shift
f015b266
ggerganov metal : disable concurrency optimization
86c90e34
ggerganov llama : add llama_kv_cache_shift_seq + no more context swaps
0cbf3bfe
ggerganov ggerganov force pushed from 6289ed6b to 0cbf3bfe 2 years ago
ggerganov llama : apply K-cache roping for Falcon and Baichuan
7c1bdd0e
ggerganov speculative : fix KV cache management
1f17ea63
JohannesGaessler
ggerganov ggerganov force pushed from 976ff05a to 5bda9e27 2 years ago
ggerganov parallel : example for serving multiple users in parallel
0161372b
ggerganov ggerganov force pushed from 5bda9e27 to 0161372b 2 years ago
ggerganov parallel : disable hot-plug to avoid cache fragmentation
466b5138
ggerganov fixes : speculative KV cache + llama worst-case graph
897caccd
ggerganov
ggerganov llama : extend batch API to select which logits to output
fa0e6778
slaren
ggerganov llama : fix worst case graph build
daf4c6d3
ggerganov
slaren ggml-cuda : update rope implementation for parallel decoding (#3254)
7e2b9974
ggerganov make : add parallel to build + fix static functions in llama.cpp
25bd2540
ggerganov simple : fix token counting
467e3079
ggerganov parallel : various improvements
36714e16
ggerganov llama : fix cell_max logic + rename functions
ddad2277
ggerganov parallel : try smaller batches when the KV cache is fragmented
806d397c
ggerganov parallel : fix sequence termination criteria
16090a5d
ggerganov llama : silence errors KV cache errors
d37081ae
ggerganov ggerganov force pushed from 7fa23d75 to d37081ae 2 years ago
ggerganov parallel : remove new line from prompt
82e20e9b
slaren
ggerganov
slaren
ggerganov parallel : process system prompt once + configurable paramters + llam…
4b5f3cd6
ggerganov parallel : remove question with short answers
8a9aca37
ggerganov parallel : count cache misses
eed3fd42
ggerganov parallel : print misses on each request
6028879f
netrunnereve
ggerganov parallel : minor
7b7472ee
ggerganov llama : fix n_kv to never become 0
e1067efb
ggerganov parallel : rename hot-plug to continuous-batching
a1327c71
ggerganov llama : improve llama_batch API + simplify parallel example
addae65f
ggerganov ggerganov force pushed from 464720fc to addae65f 2 years ago
ggerganov simple : add parallel decoding support
b377bf22
ggerganov
ggerganov simple : improve comments + free batch
db0fc2da
slaren ggml-cuda : add rope f16, restore performance with parallel decoding …
e04dc519
ggerganov llama : disable MPI for now
54206962
ggerganov train : make KQ_pos memory buffer permanent via dummy scale op
2f3a46fc
slaren ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275)
1be2b8c1
ggerganov
ggerganov ggerganov marked this pull request as ready for review 2 years ago
ggerganov parallel : fix bug (extra BOS) + smaller token_prev array
ee1d670c
AutonomicPerfectionist
ggerganov
ggerganov parallel : fix cases where the input prompts can overflow the batch
ded9b43c
ggerganov parallel : add disabled experimental batch chunking in powers of two
b2debf65
ggerganov llama : llama.h formatting + comments
5a3369d8
ggerganov simple : add README.md
88451600
pudepiedj
nytopop
pudepiedj
JohannesGaessler
pudepiedj
slaren
slaren
slaren commented on 2023-09-26
ggerganov llama : fix kv cache heuristic when context is less than 32
c1596f63
pudepiedj
cebtenzzre
ggerganov Merge branch 'master' into custom-attention-mask
25856900
ggerganov parallel : fix crash when `-n -1`
4ad06769
ggerganov llama : simplify returns if/else branches
e9463792
ggerganov metal : use mm kernels for batch size > 2
4c72ab13
pudepiedj
ggerganov examples : utilize new llama_get_logits_ith()
d008733e
ggerganov examples : add example for batched decoding
a2075615
slaren
ggerganov examples : do not eval prompt 2 times (close #3348)
2b8830af
ggerganov
slaren
ggerganov server : clear the KV cache beyond n_past before llama_decode
ce2d995a
slaren
slaren approved these changes on 2023-09-28
ggerganov server : avoid context swaps by shifting the KV cache
c5650ed4
ggerganov ggerganov merged ec893798 into master 2 years ago
pudepiedj
henk717
ggerganov
pudepiedj
Mihaiii
IridiumMaster
pudepiedj
ggerganov
cebtenzzre
cebtenzzre commented on 2023-10-05
MrJackSpade
GentlemanOfCulture

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone