Just came across another RoPE adjustment method on Reddit. Thought it might be helpful, so here's the link!
Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning
This still means i will get better perplexity-performance when usingvthis pr with @TheBloke 's supehot model-variants I guess?
Somehow exciting?
I use my merged 13b model (wizard vicuña + starcoder + superhot 16k)
With that 16k command.
Looks reasonable.
And and , what if I want to test 32k or higher, how to set both parameters? Any ideas?
The --rope-freq-scale
is the same scale used in "superhot/rope interpolation". superhot 8k lora corresponds to --rope-freq-scale 0.25 -c 8192
, which is a factor of 4 increase. Similarly superhot 16k lora and longchat 16k corresponds to --rope-freq-scale 0.125 -c 16384
, for a factor of 8.
The --rope-freq-base
simplifies the "NTK-Aware Scaled RoPE". The base number here corresponds to 10000*alpha**(64/63)
using the alpha
introduced in the reddit post. I'm not aware of any direct translation of how context length corresponds to the base
or alpha
. My limited testing with 13B models show a rough quadratic correspondence, C = -0.0652*b*b + 0.862*b + 0.203
, for C
the factor of context length increase, and b
the factor of base increase, roughly
base | effective ctx factor | effective ctx |
---|---|---|
20000 | 1.66 | 3400 |
26000 | 2 | 4096 |
40000 | 2.6 | 5300 |
57200 | 3.0 | 6144 |
I found base>60000
didn't feel good, though I've no hard numbers to back this up.
Empirically, without fine tune, you could try
-c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000
-c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000
-c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200
With superhot 16k or longchat 13b, perhaps you could try (KV cache alone requires 25GB!!)
-c 32768 --rope-freq-scale 0.125 --rope-freq-base 26000
I used some numbers posted by @JohannesGaessler, and made changes in scratch0 size in this PR. I can rebase this PR on their PR #2056 if needed.
I think the numbers that I determined for the VRAM scratch buffer will probably work for the RAM scratch buffer but I would still advise you to be cautious since the two types of scratch buffer work differently (the VRAM scratch buffer has tighter limits).
I tried the -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000
configuration with the wizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin
model and got ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960)
. Even with 0 layers offloaded.
guanaco-65B.ggmlv3.q4_K_M.bin
works with those settings.
I tried the
-c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000
configuration with thewizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin
model and gotggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960)
.
I can't reproduce. Is CUDA build different? I can't test CUDA build. The number 482344960 seems to be computed from
{ MODEL_30B, ((size_t) n_ctx / 10ull + 256ull) * MB }
with n_ctx = 2048
. Do you use main
or other method/executable to run it?
I am trying this out, and it is working fine for me so far, though I've only tried:
model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin
context: -c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000
I'll keep trying other models and context sizes and seeing how it goes. Not sure if other fixes are included, but this seems to make inference way faster (via fewer long pauses to do CPU-related tasks) on my machine as well. Possibly just due to not having to recompute context as I hit the 2048-byte mark so often?
EDIT: Also working just fine for me:
model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin
context: -c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200
I can't reproduce. Is CUDA build different?[...} Do you use
main
or other method/executable to run it?
Yes, it's a CUDA build for a 1080ti and I really should have used main
for reporting, but didn't. I'll try main
tomorrow.
I can't reproduce. Is CUDA build different?[...} Do you use
main
or other method/executable to run it?Yes, it's a CUDA build for a 1080ti and I really should have used
main
for reporting, but didn't. I'll trymain
tomorrow.
If you could give me a stack trace of when MEM_REQ_SCRATCH0 is called, I could try to figure out what is wrong with the CUDA build. Otherwise, I'll see if I can get a system somewhere with cuda.
Can't reproduce the error today, no idea what I did exactly to trigger it...
This looks great, but similar to #1967 - let's wait for a while before merging.
There are new RoPE scaling techniques popping up by the hour each one better the other. No reason to commit to something just yet
- or dial up base more?
Test with my merged 13b vicuña model(wizardvicuña + starcoder Lora + gpt4tools + 16k superhot)
16k With perplexity
Chunks decrease to 20 in 16k
Base 70000 scale 0.4 [1] 5.5564
Base 57200 scale 0.5 [1] 6.7699
base 68000 scale 0.5 [1] 5.3758
Base 70000 scale 0.5 [1] 5.3508
Base 75000 scale 0.5 [1] 5.3529
Base 76000 scale 0.5 [1] 5.3532
Base 78000 scale 0.5 [1] 5.3573
base 80000 scale 0.5 [1] 5.3710
base 84000 scale 0.5 [1] 5.4351
base 100000 scale 0.5 [1] 5.6484
Base 120000 scale 0.5 [1] 5.7999
the chunks decrease while ctx enlarged, that might be the reason for some perplexity problem? but obviously not here.
20k cause the chunks decrease to 16.
20k
Base 68000 scale 0.4 [1] 5.7306
base 70000 scale 0.4 [1] 5.7083
Base 72000 scale 0.4 [1] 5.7550
Base 11000 scale 0.4 [1] 6.2897
Base 150000 scale 0.4 [1] 6.6441
base 100000 scale 0.5 [1] 5.7545
base 110000 scale 0.5 [1] 5.7393
base 120000 scale 0.5 [1] 5.8566
32k
I believe 13b MEM_REQ_EVAL is not enough to test🤷
Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000
No enough space in the contexts memory pool needed 543758112 available 536870912
13b c 32768 scale 0.25 base 120000
Segment fault needed 108054424 available 1073741824
KV cache alone requires 25GB!!
Could we quantize the KV cache?
KV cache alone requires 25GB!!
Could we quantize the KV cache?
Another solution #1955
Btw I just saw
falcon v split
KV cache alone requires 25GB!!
Could we quantize the KV cache?
I think this was tried, and resulted in bad results. It should already be in f16.
but i dont remember, if we tried 8bit quantization...
edit: do we use flashattention for the forward pass?
For some reason, the server example output some random unicode characters when using --rope-freq-scale 0.25 -c 8192
but --rope-freq-scale 0.25 -c 4096
work correctly and --rope-freq-scale 0.25 -c 8192
work on cli.
server gives me 413 when the json data is large. We need help from those who contributed server code.
server gives me 413 when the json data is large. We need help from those who contributed server code.
I believe CPPHTTPLIB_RECV_BUFSIZ
needs to be increased, right now it is 4K.
Yeah SlyEcho is right based on what I saw in the lib, setting
#define CPPHTTPLIB_RECV_BUFSIZ size_t(<SOME NUMBER HERE>)
before the httplib.h import should be the correct way to increase it, I believe.
Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000
No enough space in the contexts memory pool needed 543758112 available 536870912
The only thing that I know of allocating 512 MB (536870912) is from MEM_REQ_EVAL
, which this PR didn't change. Maybe try changing the line
{ MODEL_3B, 512ull * MB },
to something like
{ MODEL_3B, 600ull * MB },
and see if it helps?
I believe
CPPHTTPLIB_RECV_BUFSIZ
needs to be increased, right now it is 4K.
It looks like a simple read buffer to me, and it's separate from the overall size limit.
It looks like a simple read buffer to me, and it's separate from the overall size limit.
Server::set_payload_max_length(uint64_t length)
might be what we're after then.
svr.set_payload_max_length(1024 * 1024 * 1);
would set it to 1MB (left the 1 in for example purposes)
It looks like a simple read buffer to me, and it's separate from the overall size limit.
Server::set_payload_max_length(uint64_t length)
might be what we're after then.
svr.set_payload_max_length(1024 * 1024 * 1);
would set it to 1MB (left the 1 in for example purposes)
the default is actually
#define CPPHTTPLIB_PAYLOAD_MAX_LENGTH ((std::numeric_limits<size_t>::max)())
The issue with server is just because the curl commands don't have the proper content type specified, and that triggers a limit in httplib.h, CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
.
It'd probably be a good idea to investigate the changes made in in this PR. See also the author's comment on the transformers PR
This comment from the transformers PR helped me understand what this new NTK-By-Parts/NTKv2 method is about.
AFAIK, LLongMAv2 is a finetune of LLaMA using NTKv2 instead of NTKv1 linear RoPE scaling (NTKv1 cannot be fine-tuned as it does not converge AFAIK).
I converted and quantized LLongMA-3b to ggml: huggingface.co/SlyEcho/LLongMA-3b-ggml.
It does seem to be working with long contexts using -c 8192 --rope-freq-scale 0.25
.
It'd probably be a good idea to investigate the changes made in in this PR. See also the author's comment on the transformers PR
This comment from the transformers PR helped me understand what this new LLongMAv2/NTK-By-Parts/NTKv2 method is about (they're all the same thing).
Somebody please explain the python code in equations.
What is the latest state of this approach - is it worth merging and supporting?
I've been using this on a Mac M1 Max since the PR was raised and it's working fine for me. I've been hoping it will get merged so I can go back to compiling from master
again. Really enjoying having 8k context.
Let's merge and maybe then improve later.
15760 | 15760 | if (src0->grad) { | |
15761 | 15761 | assert(src1->type == GGML_TYPE_I32); | |
15762 | assert(ggml_nelements(src1) == 4); | ||
15762 | assert(ggml_nelements(src1) == 3); |
Shouldn't this be 6? Based on the code immediately after it should be at least 4, I think, not 3.
I went through the code an I also can't see why it's 3 when the lines just below it show it clearly taking 4 elements and looks like it designed to fail the assertion
Should be fixed in 513f861
Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?
https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L2955
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12418
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12517
Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?
for the ggml.c
lines, those appear to be the rope backwards passes, confusingly named forward_rope_back
I left the backward code untouched because I wasn't sure how I could correctly modify it and test it.
I'm also not sure about cuda bits.
The CUDA part is broken right now, it should be fixed.
How do I implement this with RoPE and without it with current LLMs?
How do I implement this with RoPE and without it with current LLMs?
You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md
Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.
How do I implement this with RoPE and without it with current LLMs?
You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md
Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.
Thanks @abc-nix!
What about the implementation of customized RoPE
Sorry, @bilal-aamer, I am not sure what you are trying to ask here.
This PR adds customized RoPE support. Latter, YaRN RoPE scaling was added in PR #2268 and some other fixes were added after that.
main's help has this to say about how the options and parameter to make use of RoPE/YaRN:
--rope-scaling {none,linear,yarn}
RoPE frequency scaling method, defaults to linear unless specified by the model
--rope-scale N RoPE context scaling factor, expands context by a factor of N
--rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
--rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N
--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size)
--yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)
--yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
--yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0)
--yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0)
I am not sure what you are trying to achieve or what exactly you are asking. Hopefully someone else isn't as obtuse as me and can help you out.
Is there any documentation on how to implement this or an example? I am kind of new in the field and I am fine tuning code llama 2 and I want to increase the context length. But between all these posts I am sort of confused how to implement it actually.
This is my implementation:
accelerate launch --config_file "./fsdp_config.yaml" fsdp_acc2.py
--rope_scaling 0.25
this is the error I am getting:
RuntimeError: The size of tensor a (16384) must match the size of tensor b (16385) at non-singleton dimension
Login to write a write a comment.
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k
Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5