PR #2054 Implement customizable RoPE

jxy1 year ago👍 3🎉 2🚀 9

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k
Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

1980Dragon1 year ago👍 3🚀 6

Just came across another RoPE adjustment method on Reddit. Thought it might be helpful, so here's the link!
Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning

maddes8cht1 year ago

This still means i will get better perplexity-performance when usingvthis pr with @TheBloke 's supehot model-variants I guess?

FNsi1 year ago (edited 1 year ago)

Somehow exciting?

I use my merged 13b model (wizard vicuña + starcoder + superhot 16k)

With that 16k command.

5.3710, 2. 5.7513

Looks reasonable.

And and , what if I want to test 32k or higher, how to set both parameters? Any ideas?

jxy1 year ago (edited 1 year ago)👍 6❤ 1

The --rope-freq-scale is the same scale used in "superhot/rope interpolation". superhot 8k lora corresponds to --rope-freq-scale 0.25 -c 8192, which is a factor of 4 increase. Similarly superhot 16k lora and longchat 16k corresponds to --rope-freq-scale 0.125 -c 16384, for a factor of 8.

The --rope-freq-base simplifies the "NTK-Aware Scaled RoPE". The base number here corresponds to 10000*alpha**(64/63) using the alpha introduced in the reddit post. I'm not aware of any direct translation of how context length corresponds to the base or alpha. My limited testing with 13B models show a rough quadratic correspondence, C = -0.0652*b*b + 0.862*b + 0.203, for C the factor of context length increase, and b the factor of base increase, roughly

base	effective ctx factor	effective ctx
20000	1.66	3400
26000	2	4096
40000	2.6	5300
57200	3.0	6144

I found base>60000 didn't feel good, though I've no hard numbers to back this up.

Empirically, without fine tune, you could try

-c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000
-c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000
-c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200

With superhot 16k or longchat 13b, perhaps you could try (KV cache alone requires 25GB!!)

-c 32768 --rope-freq-scale 0.125 --rope-freq-base 26000
or dial up base more?

jxy1 year ago

I used some numbers posted by @JohannesGaessler, and made changes in scratch0 size in this PR. I can rebase this PR on their PR #2056 if needed.

JohannesGaessler1 year ago

I think the numbers that I determined for the VRAM scratch buffer will probably work for the RAM scratch buffer but I would still advise you to be cautious since the two types of scratch buffer work differently (the VRAM scratch buffer has tighter limits).

trap201 year ago (edited 1 year ago)

I tried the -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000 configuration with the wizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin model and got ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960). Even with 0 layers offloaded.

guanaco-65B.ggmlv3.q4_K_M.bin works with those settings.

jxy1 year ago (edited 1 year ago)

I tried the -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000 configuration with the wizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin model and got ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960).

I can't reproduce. Is CUDA build different? I can't test CUDA build. The number 482344960 seems to be computed from

{ MODEL_30B, ((size_t) n_ctx / 10ull + 256ull) * MB }

with n_ctx = 2048. Do you use main or other method/executable to run it?

time-less-ness1 year ago (edited 1 year ago)👍 1

I am trying this out, and it is working fine for me so far, though I've only tried:

model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin
context: -c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000

I'll keep trying other models and context sizes and seeing how it goes. Not sure if other fixes are included, but this seems to make inference way faster (via fewer long pauses to do CPU-related tasks) on my machine as well. Possibly just due to not having to recompute context as I hit the 2048-byte mark so often?

EDIT: Also working just fine for me:

model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin 
context: -c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200

trap201 year ago

I can't reproduce. Is CUDA build different?[...} Do you use main or other method/executable to run it?

Yes, it's a CUDA build for a 1080ti and I really should have used main for reporting, but didn't. I'll try main tomorrow.

jxy1 year ago

I can't reproduce. Is CUDA build different?[...} Do you use main or other method/executable to run it?

Yes, it's a CUDA build for a 1080ti and I really should have used main for reporting, but didn't. I'll try main tomorrow.

If you could give me a stack trace of when MEM_REQ_SCRATCH0 is called, I could try to figure out what is wrong with the CUDA build. Otherwise, I'll see if I can get a system somewhere with cuda.

trap201 year ago

Can't reproduce the error today, no idea what I did exactly to trigger it...

ggerganov1 year ago👍 3

This looks great, but similar to #1967 - let's wait for a while before merging.
There are new RoPE scaling techniques popping up by the hour each one better the other. No reason to commit to something just yet

FNsi1 year ago (edited 1 year ago)

or dial up base more?

Test with my merged 13b vicuña model(wizardvicuña + starcoder Lora + gpt4tools + 16k superhot)

16k With perplexity
Chunks decrease to 20 in 16k

Base 70000 scale 0.4 [1] 5.5564

Base 57200 scale 0.5 [1] 6.7699
base 68000 scale 0.5 [1] 5.3758
Base 70000 scale 0.5 [1] 5.3508
Base 75000 scale 0.5 [1] 5.3529
Base 76000 scale 0.5 [1] 5.3532
Base 78000 scale 0.5 [1] 5.3573
base 80000 scale 0.5 [1] 5.3710
base 84000 scale 0.5 [1] 5.4351
base 100000 scale 0.5 [1] 5.6484
Base 120000 scale 0.5 [1] 5.7999

the chunks decrease while ctx enlarged, that might be the reason for some perplexity problem? but obviously not here.

20k cause the chunks decrease to 16.

20k
Base 68000 scale 0.4 [1] 5.7306
base 70000 scale 0.4 [1] 5.7083
Base 72000 scale 0.4 [1] 5.7550
Base 11000 scale 0.4 [1] 6.2897
Base 150000 scale 0.4 [1] 6.6441
base 100000 scale 0.5 [1] 5.7545
base 110000 scale 0.5 [1] 5.7393
base 120000 scale 0.5 [1] 5.8566

32k

I believe 13b MEM_REQ_EVAL is not enough to test🤷

FNsi1 year ago (edited 1 year ago)

Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000

No enough space in the contexts memory pool needed 543758112 available 536870912

13b c 32768 scale 0.25 base 120000
Segment fault needed 108054424 available 1073741824

SlyEcho1 year ago

KV cache alone requires 25GB!!

Could we quantize the KV cache?

FNsi1 year ago (edited 1 year ago)

KV cache alone requires 25GB!!

Could we quantize the KV cache?

Another solution #1955

Btw I just saw
falcon v split

Green-Sky1 year ago (edited 1 year ago)

KV cache alone requires 25GB!!

Could we quantize the KV cache?

I think this was tried, and resulted in bad results. It should already be in f16.
but i dont remember, if we tried 8bit quantization...

edit: do we use flashattention for the forward pass?

ardfork1 year ago

For some reason, the server example output some random unicode characters when using --rope-freq-scale 0.25 -c 8192 but --rope-freq-scale 0.25 -c 4096 work correctly and --rope-freq-scale 0.25 -c 8192 work on cli.

jxy force pushed from 12e23c08 to a6e392a7 1 year ago

jxy1 year ago

server gives me 413 when the json data is large. We need help from those who contributed server code.

SlyEcho1 year ago

server gives me 413 when the json data is large. We need help from those who contributed server code.

I believe CPPHTTPLIB_RECV_BUFSIZ needs to be increased, right now it is 4K.

digiwombat1 year ago

Yeah SlyEcho is right based on what I saw in the lib, setting
#define CPPHTTPLIB_RECV_BUFSIZ size_t(<SOME NUMBER HERE>)
before the httplib.h import should be the correct way to increase it, I believe.

jxy1 year ago

Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000

No enough space in the contexts memory pool needed 543758112 available 536870912

The only thing that I know of allocating 512 MB (536870912) is from MEM_REQ_EVAL, which this PR didn't change. Maybe try changing the line

{ MODEL_3B, 512ull * MB },

to something like

{ MODEL_3B, 600ull * MB },

and see if it helps?

jxy force pushed from a6e392a7 to b2fda181 1 year ago

jxy1 year ago

I believe CPPHTTPLIB_RECV_BUFSIZ needs to be increased, right now it is 4K.

It looks like a simple read buffer to me, and it's separate from the overall size limit.

digiwombat1 year ago

It looks like a simple read buffer to me, and it's separate from the overall size limit.

Server::set_payload_max_length(uint64_t length) might be what we're after then.

svr.set_payload_max_length(1024 * 1024 * 1); would set it to 1MB (left the 1 in for example purposes)

jxy1 year ago

It looks like a simple read buffer to me, and it's separate from the overall size limit.

Server::set_payload_max_length(uint64_t length) might be what we're after then.

svr.set_payload_max_length(1024 * 1024 * 1); would set it to 1MB (left the 1 in for example purposes)

the default is actually

#define CPPHTTPLIB_PAYLOAD_MAX_LENGTH ((std::numeric_limits<size_t>::max)())

Implement customizable RoPE

dc0d0eb6

ggml-metal: fix custom rope

1ae4318d

common: fix argument names in help

41819b0b

llama: increase MEM_REQ_EVAL for MODEL_3B

5c6eed39

llama: make MEM_REQ_EVAL depend on n_ctx

a728a0d1

jxy force pushed from b2fda181 to a728a0d1 1 year ago

server: use proper Content-Type in curl examples

a3b4d932

jxy1 year ago👍 1

The issue with server is just because the curl commands don't have the proper content type specified, and that triggers a limit in httplib.h, CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192.

tmm11 year ago

LLongMA https://twitter.com/enricoshippole/status/1677346583640256513?s=46&t=hIokEbug9Pr72tQFuXVULA

IgnacioFDM1 year ago🚀 3

It'd probably be a good idea to investigate the changes made in in this PR. See also the author's comment on the transformers PR

cebtenzzre1 year ago (edited 1 year ago)🎉 2

This comment from the transformers PR helped me understand what this new NTK-By-Parts/NTKv2 method is about.

Graph from the scaled-rope PR comparing NTK to NTKv2 ("corrected")

Graph from the transformers PR comparing Dynamic NTK to LLongMA and LLongMAv2

Graph from an unknown source comparing several methods

AFAIK, LLongMAv2 is a finetune of LLaMA using NTKv2 instead of ~~NTKv1~~ linear RoPE scaling (NTKv1 cannot be fine-tuned as it does not converge AFAIK).

SlyEcho1 year ago

I converted and quantized LLongMA-3b to ggml: huggingface.co/SlyEcho/LLongMA-3b-ggml.

It does seem to be working with long contexts using -c 8192 --rope-freq-scale 0.25.

jxy1 year ago

It'd probably be a good idea to investigate the changes made in in this PR. See also the author's comment on the transformers PR

This comment from the transformers PR helped me understand what this new LLongMAv2/NTK-By-Parts/NTKv2 method is about (they're all the same thing).

Somebody please explain the python code in equations.

Merge remote-tracking branch 'upstream/master' into custom_rope

a6b56957

Merge branch 'custom_rope' of github.com:jxy/llama.cpp into custom_rope

da730c53

ggerganov1 year ago👍 4👀 1

What is the latest state of this approach - is it worth merging and supporting?

time-less-ness1 year ago👍 2

I've been using this on a Mac M1 Max since the PR was raised and it's working fine for me. I've been hoping it will get merged so I can go back to compiling from master again. Really enjoying having 8k context.

SlyEcho1 year ago🚀 10👀 1

Let's merge and maybe then improve later.

style : minor fixes, mostly indentations

d0b6c942

ggerganov force pushed from f487d5e5 to d0b6c942 1 year ago

ggml : fix asserts

6024bccd

ggerganov approved these changes on 2023-07-15

ggerganov merged 6e7cca40 into master 1 year ago

cebtenzzre commented on 2023-07-18

ggml.c

15760	15760	if (src0->grad) {
15761	15761	assert(src1->type == GGML_TYPE_I32);
15762		assert(ggml_nelements(src1) == 4);
	15762	assert(ggml_nelements(src1) == 3);

cebtenzzre1 year ago

Shouldn't this be 6? Based on the code immediately after it should be at least 4, I think, not 3.

DifferentialityDevelopment1 year ago

I went through the code an I also can't see why it's 3 when the lines just below it show it clearly taking 4 elements and looks like it designed to fail the assertion

cebtenzzre1 year ago

@ggerganov

ggerganov1 year ago

Should be fixed in 513f861

LostRuins1 year ago👀 3

Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?

https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L2955
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12418
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12517

Green-Sky1 year ago

Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?

for the ggml.c lines, those appear to be the rope backwards passes, confusingly named forward_rope_back

jxy1 year ago👀 1

I left the backward code untouched because I wasn't sure how I could correctly modify it and test it.

I'm also not sure about cuda bits.

SlyEcho1 year ago

The CUDA part is broken right now, it should be fixed.

bilal-aamer1 year ago

How do I implement this with RoPE and without it with current LLMs?

abc-nix1 year ago👍 1

How do I implement this with RoPE and without it with current LLMs?

You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md

Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.

bilal-aamer1 year ago

How do I implement this with RoPE and without it with current LLMs?

You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md

Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.

Thanks @abc-nix!

What about the implementation of customized RoPE

abc-nix1 year ago

Sorry, @bilal-aamer, I am not sure what you are trying to ask here.

This PR adds customized RoPE support. Latter, YaRN RoPE scaling was added in PR #2268 and some other fixes were added after that.

main's help has this to say about how the options and parameter to make use of RoPE/YaRN:

  --rope-scaling {none,linear,yarn}
                        RoPE frequency scaling method, defaults to linear unless specified by the model
  --rope-scale N        RoPE context scaling factor, expands context by a factor of N
  --rope-freq-base N    RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
  --rope-freq-scale N   RoPE frequency scaling factor, expands context by a factor of 1/N
  --yarn-orig-ctx N     YaRN: original context size of model (default: 0 = model training context size)
  --yarn-ext-factor N   YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)
  --yarn-attn-factor N  YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
  --yarn-beta-slow N    YaRN: high correction dim or alpha (default: 1.0)
  --yarn-beta-fast N    YaRN: low correction dim or beta (default: 32.0)

I am not sure what you are trying to achieve or what exactly you are asking. Hopefully someone else isn't as obtuse as me and can help you out.

madiarabis1 year ago (edited 1 year ago)

Is there any documentation on how to implement this or an example? I am kind of new in the field and I am fine tuning code llama 2 and I want to increase the context length. But between all these posts I am sort of confused how to implement it actually.

This is my implementation:
accelerate launch --config_file "./fsdp_config.yaml" fsdp_acc2.py
--rope_scaling 0.25

this is the error I am getting:
RuntimeError: The size of tensor a (16384) must match the size of tensor b (16385) at non-singleton dimension

jxy deleted the custom_rope branch 1 year ago

llama.cpp
Implement customizable RoPE
#2054

Merged

Implement customizable RoPE #2054

llama.cpp Implement customizable RoPE #2054 Merged

Implement customizable RoPE #2054

llama.cpp
Implement customizable RoPE
#2054

Merged