llama.cpp
llama : custom attention mask + parallel decoding + no context swaps
#3228

Merged

llama : custom attention mask + parallel decoding + no context swaps #3228

ggerganov merged 57 commits into master from custom-attention-mask

tests : verify that RoPE is "additive"

c5df72e8

llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask)

3b4bab6a

ggml : ggml_rope now takes a vector with positions instead of n_past

1fb033fd

ggerganov force pushed from d4cd2633 to 1fb033fd 2 years ago

ggerganov commented on 2023-09-17

metal : add rope_f16 kernel + optimize cpy kernels

fad56936

ggerganov force pushed from 57cea733 to fad56936 2 years ago

llama : unified KV cache + batch inference API

d29e7693

Merge branch 'master' into custom-attention-mask

58bb5110

ggerganov added high priority

ggerganov added need feedback

llama : add new llama_decode() API that works with llama_batch

9f42e754

llama : add cell_max heuristic for more efficient kv_cache

6952a460

llama : extend llama_kv_cache API

4d76d762

llama : more robust cell_max heuristic + wip shift

f015b266

metal : disable concurrency optimization

86c90e34

llama : add llama_kv_cache_shift_seq + no more context swaps

0cbf3bfe

ggerganov force pushed from 6289ed6b to 0cbf3bfe 2 years ago

llama : apply K-cache roping for Falcon and Baichuan

7c1bdd0e

speculative : fix KV cache management

1f17ea63

ggerganov force pushed from 976ff05a to 5bda9e27 2 years ago

parallel : example for serving multiple users in parallel

0161372b

ggerganov force pushed from 5bda9e27 to 0161372b 2 years ago

parallel : disable hot-plug to avoid cache fragmentation

466b5138

fixes : speculative KV cache + llama worst-case graph

897caccd

llama : extend batch API to select which logits to output

fa0e6778

llama : fix worst case graph build

daf4c6d3

ggml-cuda : update rope implementation for parallel decoding (#3254)

7e2b9974

make : add parallel to build + fix static functions in llama.cpp

25bd2540

simple : fix token counting

467e3079

parallel : various improvements

36714e16

llama : fix cell_max logic + rename functions

ddad2277

parallel : try smaller batches when the KV cache is fragmented

806d397c

parallel : fix sequence termination criteria

16090a5d

llama : silence errors KV cache errors

d37081ae

ggerganov force pushed from 7fa23d75 to d37081ae 2 years ago

parallel : remove new line from prompt

82e20e9b

parallel : process system prompt once + configurable paramters + llam…

4b5f3cd6

parallel : remove question with short answers

8a9aca37

parallel : count cache misses

eed3fd42

parallel : print misses on each request

6028879f

parallel : minor

7b7472ee

llama : fix n_kv to never become 0

e1067efb

parallel : rename hot-plug to continuous-batching

a1327c71

llama : improve llama_batch API + simplify parallel example

addae65f

ggerganov force pushed from 464720fc to addae65f 2 years ago

simple : add parallel decoding support

b377bf22

simple : improve comments + free batch

db0fc2da

ggml-cuda : add rope f16, restore performance with parallel decoding …

e04dc519

llama : disable MPI for now

54206962

train : make KQ_pos memory buffer permanent via dummy scale op

2f3a46fc

ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275)

1be2b8c1

ggerganov marked this pull request as ready for review 2 years ago

parallel : fix bug (extra BOS) + smaller token_prev array

ee1d670c

parallel : fix cases where the input prompts can overflow the batch

ded9b43c

parallel : add disabled experimental batch chunking in powers of two

b2debf65

llama : llama.h formatting + comments

5a3369d8

simple : add README.md

88451600

slaren commented on 2023-09-26

llama : fix kv cache heuristic when context is less than 32

c1596f63

Merge branch 'master' into custom-attention-mask

25856900

parallel : fix crash when `-n -1`

4ad06769

llama : simplify returns if/else branches

e9463792

metal : use mm kernels for batch size > 2

4c72ab13

examples : utilize new llama_get_logits_ith()

d008733e

examples : add example for batched decoding

a2075615

examples : do not eval prompt 2 times (close #3348)

2b8830af

server : clear the KV cache beyond n_past before llama_decode

ce2d995a

slaren approved these changes on 2023-09-28

server : avoid context swaps by shifting the KV cache

c5650ed4

ggerganov merged ec893798 into master 2 years ago

cebtenzzre commented on 2023-10-05

Reviewers

slaren

cebtenzzre

xaedes

Assignees

No one assigned

Labels

high priority need feedback

Milestone

No milestone

llama.cpp llama : custom attention mask + parallel decoding + no context swaps #3228 Merged

llama : custom attention mask + parallel decoding + no context swaps #3228

llama.cpp
llama : custom attention mask + parallel decoding + no context swaps
#3228

Merged