Any guide to set para extrapolation and ntk? How do they work with previous two paras?
The upstream NTKv2 doesn't use --rope-freq-base, so it probably doesn't make sense to use it. It does use --rope-freq-scale, which works like linear scaling, and is supposed to be calibrated so that e.g. .25 scale actually gives you 8192 context. To use the default NTKv2, you should set --rope-ntk-factor and --rope-extrapolation-factor to 1, and set --rope-freq-scale appropriately. The lower the factors are, the less the respective scaling methods are mixed in, although I believe the graphs have been generated with both at 100% - the code automatically ramps them based on some experimentally determined thresholds.
I would appreciate help with the following:
Rename everywhere extrapolation_factor
to ext_factor
No need for backwards implementation for now
Base command (with LLAMA_CUBLAS=1): ./perplexity -m llama-7b.ggmlv3.q4_0.bin -f wiki.test.raw -ngl 100 -mmq -c 8192
Perplexity results on WikiText-2:
Arguments | Perplexity |
---|---|
Linear --rope-freq-scale .25 |
10.39 |
NTKv1 --rope-freq-scale 0.75 --rope-freq-base 57200 |
7.03 |
NTKv2 --rope-freq-scale .25 --rope-ntk-factor 1 --rope-ext-factor 1 |
9.24 |
One problem that remains is that the context size the model was originally trained on is hardcoded to 2048. I could either add a parameter for it, or wait for GGUF.
Perplexity with NTKv2 may be worse because neither is the dynamic version, which AFAIK works better on non-finetuned models. But fine-tuned models are far superior anyway.
NTKv1 does not converge when fine-tuning, which is why NTKv2 exists. So until somebody publishes a model fine-tuned with NTKv2βmaybe LLongMAv2 will be released after jquesnelle publishes the paper based on scaled-ropeβthe existing LLongMA, which uses regular linear interpolation (just like SuperHOT), is the state-of-the-art for long contexts.
The paper has been released. The resulting method is called YaRN. Apparently the models that use this technique are good to about 120k tokens of context.
More work will definitely be needed to use these models with llama.cpp.
There are NaNs getting in somewhere:
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 72.03 MB
main: ggml.c:12228: ggml_compute_forward_soft_max_f32: Assertion `!isnan(sp[i])' failed.
Thank you for the llamacpp implementation of YaRN!
I'm just letting you know that
constant float max_pos_emb = 2048;
should be changed to 4096 for llama 2 models when using YaRN (default was 2048 because we did the most tests with llama 1 models)
This value should probably be saved inside of the model configs and be loaded on inference...
should be changed to 4096 for llama 2 models
Thanks for reminding me. I originally made this PR before GGUF was finished, so I hardcoded it in the meantime. I believe I can now use the value of llama.context_length
for this purpose.
Would it be worth testing this with non-YaRN fine-tuned models? If so, any suggested settings? I can test it with ROCM.
Thank you for the llamacpp implementation of YaRN!
I'm just letting you know that
constant float max_pos_emb = 2048;should be changed to 4096 for llama 2 models when using YaRN (default was 2048 because we did the most tests with llama 1 models) This value should probably be saved inside of the model configs and be loaded on inference...
this needs to be a new GGUF kv, something like "rope_yarn_orig_ctx"
Thanks for reminding me. I originally made this PR before GGUF was finished, so I hardcoded it in the meantime. I believe I can now use the value of llama.context_length for this purpose.
llama.context_length
should be the size of the finetune. eg 128Ki
this needs to be a new GGUF kv, something like "rope_yarn_orig_ctx"
Exactly, after finetuning a model with YaRN, we have to keep track of two values, one being the original context length (2048 for LLaMA or 4096 for Llama 2), and also the final context length (which can be calculated by multipling the original ctx length by the scale factor, eg. 4096 x 32 = 128Ki)
In this case, the constant constant float max_pos_emb = 2048;
used in the equations must be equal to the original context size, not the final context size.
downloading the 7b 128k model rn
will test later
689 | printf(" --yarn-ext-factor N YaRN extrapolation mix factor (default: %.1f)\n", params.yarn_ext_factor); | ||
690 | printf(" --yarn-attn-factor N YaRN magnitude scaling factor (default: %.1f)\n", params.yarn_attn_factor); | ||
691 | printf(" --yarn-beta-fast N YaRN low correction dim (default: %.1f)\n", params.yarn_beta_fast); | ||
692 | printf(" --yarn-beta-slow N YaRN high correction dim (default: %.1f)\n", params.yarn_beta_slow); |
depending on how close the the paper the names of the parameters are, we might want to add a readme with more details.
I don't believe the extrapolation mix factor is described in the paper because it describes an implementation that assumes it is always $1$. attn_factor is not described in the paper, but it multiplies the value described as $\sqrt t$. The slow and fast beta are known as $\alpha$ and $\beta$ in the paper, respectively.
55 | KEY_ROPE_DIMENSION_COUNT = "{arch}.rope.dimension_count" | ||
56 | KEY_ROPE_FREQ_BASE = "{arch}.rope.freq_base" | ||
57 | KEY_ROPE_SCALING_TYPE = "{arch}.rope.scaling.type" | ||
58 | KEY_ROPE_SCALING_FACTOR = "{arch}.rope.scaling.factor" |
The removal of scale_linear is a breaking change. I suppose I should at least implement backwards compatibility (edit: done). Should there be a deprecation notice?
did you request a change in the spec in the ggml repo?
How do we move forward with this? philpax never replied. The GGUF spec isn't technically official yet since it was never merged, right?
We should keep this PR open for now. I don't see a huge benefit of merging this method atm, since I believe it is not adopted by popular foundation models, only used for fine-tuning (correct me if I'm wrong).
I don't see a huge benefit of merging this method atm
YaRN also claims to be the state-of-the-art for context scaling without finetuning. I can put together some updated perplexity numbers.
This is how it currently performs on WikiText-2 with 16K context and a 7B LLaMA-v1 model:
Method | Arguments | Perplexity |
---|---|---|
Linear | --rope-freq-scale .125 | 116.9527 +/- 0.83609 |
Linear + NTK | --rope-freq-scale .375 --rope-freq-base 57200 | 35.0422 +/- 0.23598 |
YaRN | --rope-freq-scale .125 --rope-scaling yarn | 86.4788 +/- 0.67614 |
@bloc97 Does this seem right to you?
This is how it currently performs on WikiText-2 with 16K context and a 7B LLaMA-v1 model:
Method Arguments Perplexity
Linear --rope-freq-scale .125 116.9527 +/- 0.83609
NTK --rope-freq-scale .375 --rope-freq-base 57200 35.0422 +/- 0.23598
YaRN --rope-freq-scale .125 --rope-scaling yarn 86.4788 +/- 0.67614
@bloc97 Does this seem right to you?
I am not so sure. All my tests with NTK-aware scaling was based on the scale factor (alpha), I have never tested changing freq-scale and freq-base at the same time. In equal hyperparameters scenario, YaRN should outperform everything else by a significant margin.
YaRN without finetuning at 2x scaling has almost zero PPL degradation. This means that for example, if you give Llama 2 a prompt of 4k and obtain a PPL of 4.21, then you use YaRN scaling on the model by 2x (effective context of 8k) and give it the same 4k prompt, the PPL should be like 4.23 or something. If you give it a 8k prompt, PPL would decrease to lets say 3.91.
In my tests, all other interpolation methods will have significant PPL degradations at all scaling factors.
WikiText-2 with 6144 context and a 7B LLaMA-v1 model, trying pure NTK without linear scaling:
Method | Arguments | Perplexity |
---|---|---|
Linear | --rope-freq-scale .333 | 7.2696 +/- 0.03999 |
NTK | --rope-freq-base 60000 | 6.1330 +/- 0.03285 |
YaRN | --rope-freq-scale .333 --rope-scaling yarn | 6.1653 +/- 0.03305 |
Still not better than NTK, but acceptable I guess. My implementation is probably correct.
WikiText-2 with 16K context and a 7B LLaMA-v1 model, trying pure NTK without linear scaling:
Method Arguments Perplexity
Linear --rope-freq-scale .333 7.2696 +/- 0.03999
NTK --rope-freq-base 60000 6.1330 +/- 0.03285
YaRN --rope-freq-scale .333 --rope-scaling yarn 6.1653 +/- 0.03305
Still not better than NTK, but acceptable I guess. My implementation is probably correct.
It's hard for me to compare against these numbers, as this is 3x context extension right? That means 2k * 3 = 6k extension. But the tests are on 16k context without sliding windows right? Meanwhile all of our tests were done using sliding windows (from Press et al. 2022) such that the sum was performed on the minimal PPL point for all models and all hyperparameters. The two methods are measuring different things (even if the PPL metric is the same).
But the tests are on 16k context without sliding windows
Sorry, I made a copy-paste error. This is regular 6144-context, non-sliding-window perplexity. Sliding window perplexity is implemented here but AFAIK is currently very slow.
No worries, my point was that as long as the implemented version of YaRN at x2 scaling doesn't negatively impact PPL (on a non-finetuned model), and successfully extends context to 2x the original length, its implemented correctly.
For example this is Mistral 7b with YaRN at x2 scaling without any finetuning. (sorry its the only model I had at hand when writing this), but you'll just have to trust me that Llama 2 has the exact same behaviour (but is 4k base model instead of 8k)... Note that sliding window attention was disabled for this test.
As you can see both lines are literally overlapping (maybe like +0.01PPL for YaRN when context is under 7k, but context is extended for free)
Here is the 16K (8x context extension) test again, with another row added to the table:
Method | Arguments | Perplexity |
---|---|---|
Linear | --rope-freq-scale .125 | 116.9527 +/- 0.83609 |
Linear + NTK | --rope-freq-scale .375 --rope-freq-base 57200 | 35.0422 +/- 0.23598 |
YaRN | --rope-freq-scale .125 --rope-scaling yarn | 86.4788 +/- 0.67614 |
YaRN + NTK | --rope-freq-scale .375 --rope-freq-base 57200 --rope-scaling yarn | 25.7769 +/- 0.17455 |
@ggerganov Without finetuning, YaRN behaves like linear scaling, but with better perplexity scores, especially with longer contexts. I think this is more than just a demo, I would really like to have this in master.
@cebtenzzre The PPL values (> 25.0) are too huge compared to the ~5-6 PPL at 2k context - I don't think this has any practical value if the numbers are correct. At 3x context scaling (6144 ctx size) there seems to be no benefit compared to existing NTK implementation, or the proposed implementation is not correct. So far I don't see a compelling reason to merge this change.
I don't see a significant perplexity benefit using YaRN this way with 4-5x context extension, and perplexity does start to get high outside of that range. So I guess this doesn't make sense to merge until there are more finetuned models that use it.
model | context | freq base | freq scale | linear ppl | YaRN ppl | improvement |
---|---|---|---|---|---|---|
LLaMA-2 7B | 16384 | 57200 | 0.75 | 6.1875 +/- 0.03310 | 6.1324 +/- 0.03275 | 1.009 |
LLaMA-1 7B | 8192 | 57200 | 0.75 | 7.1863 +/- 0.03970 | 7.0246 +/- 0.03864 | 1.023 |
LLaMA-1 7B | 8192 | 57200 | 0.60 | 8.7159 +/- 0.04907 | 8.1126 +/- 0.04524 | 1.074 |
There is a slight bug in this implementation that caused part of the YaRN scaling not to be applied (see cebtenzzre#1). When the fix is applied, the PPL improvements get a lot better
model | context | freq base | freq scale | linear ppl | YaRN ppl | improvement |
---|---|---|---|---|---|---|
Yarn-Llama-2-7B-64K | 16384 | 57200 | 0.75 | - | 5.4893 +/- 0.02914 | |
LLaMA-2 7B | 16384 | 57200 | 0.75 | 6.1875 +/- 0.03310 | 5.8984 +/- 0.03145 | 1.049 |
LLaMA-1 7B | 8192 | 57200 | 0.75 | 7.1863 +/- 0.03970 | 6.5401 +/- 0.03591 | 1.098 |
LLaMA-1 7B | 8192 | 57200 | 0.60 | 8.7159 +/- 0.04907 | 7.0159 +/- 0.03899 | 1.242 |
I would note that the PR as it stands applies YaRN all the time. I think it will need to be adjusted so that the GGML code selects the appropriate scaling type based on the GGUF rope.scaling.type
key. (edit: very simple fix I believe, just need to move calculation of mscale
into ext_factor
conditional above)
Just wondering how we should handle changing freq_base
and freq_scale
in the context of YaRN... YaRN by itself already finds the optimal freq_base
and freq_scale
for each dimension individually, in some sense, it can be seen as an automatic and adaptive version of Linear + NTK interpolation applied to each and every RoPE dimension slightly differently, plus a mscale
attention correction factor that further improves PPL.
IMHO when YaRN is enabled, both freq_base
and freq_scale
should be disabled as any changes will result in an inferior PPL compared to the default values.
To put it simply, all YaRN needs is the original model context length, and the target context extension (which the ratio $s$ can be computed by dividing the two). The alpha, beta and mscale hyperparameters should be determined in advance for every model in a case by case basis and be hidden from an end-user. The default alpha, beta and mscale in YaRN is currently only optimal for all LLaMA and Llama 2 family of models.
Also, there might be some more other subtle implementation differences between huggingface transformers and llama.cpp, as those PPL improvements seen above are fairly minimal. YaRN should have significatively better PPL compared to other methods in non-finetuned scenarios.
The advantages of YaRN is threefold:
Here's some benchmarks I got from huggingface transformers on the GovReport dataset. There might be still a small bug in this implementation somewhere as I'm getting better PPL improvements across the board on the reference implementation.
Fixed s=4 YaRN:
model | context | freq base | freq scale | scale-base ppl | YaRN ppl (s=4) | improvement |
---|---|---|---|---|---|---|
LLaMA-1 7B | 8192 | 10000 | 0.25 | 6.4290 | 4.3511 | 1.477 |
LLaMA-1 7B | 8192 | 74000 | 1.00 | 4.8251 | 4.3511 | 1.109 |
LLaMA-1 7B | 8192 | 57200 | 0.75 | 4.9294 | 4.3511 | 1.133 |
LLaMA-1 7B | 8192 | 57200 | 0.60 | 5.5975 | 4.3511 | 1.286 |
Dynamic-YaRN:
model | context | freq base | freq scale | scale-base ppl | Dynamic YaRN ppl | improvement |
---|---|---|---|---|---|---|
LLaMA-1 7B | 8192 | 10000 | 0.25 | 6.4290 | 4.1972 | 1.532 |
LLaMA-1 7B | 8192 | 74000 | 1.00 | 4.8251 | 4.1972 | 1.150 |
LLaMA-1 7B | 8192 | 57200 | 0.75 | 4.9294 | 4.1972 | 1.174 |
LLaMA-1 7B | 8192 | 57200 | 0.60 | 5.5975 | 4.1972 | 1.334 |
At s=8, there's no contest... (Again, note these are reference benchmarks using huggingface transformers.)
Fixed s=8 YaRN:
model | context | freq base | freq scale | scale-base ppl | YaRN ppl (s=8) | improvement |
---|---|---|---|---|---|---|
LLaMA-1 7B | 16384 | 10000 | 0.125 | 45.066 | 4.5578 | 9.888 |
LLaMA-1 7B | 16384 | 200000 | 1.00 | 7.7015 | 4.5578 | 1.690 |
LLaMA-1 7B | 16384 | 120000 | 0.75 | 8.5140 | 4.5578 | 1.868 |
LLaMA-1 7B | 16384 | 120000 | 0.60 | 10.523 | 4.5578 | 2.309 |
Dynamic-YaRN:
model | context | freq base | freq scale | scale-base ppl | Dynamic YaRN ppl | improvement |
---|---|---|---|---|---|---|
LLaMA-1 7B | 16384 | 10000 | 0.125 | 45.066 | 4.3574 | 10.34 |
LLaMA-1 7B | 16384 | 200000 | 1.00 | 7.7015 | 4.3574 | 1.767 |
LLaMA-1 7B | 16384 | 120000 | 0.75 | 8.5140 | 4.3574 | 1.954 |
LLaMA-1 7B | 16384 | 120000 | 0.60 | 10.523 | 4.3574 | 2.415 |
Okay, I think I have some definitive answers now! There was a second bug in the implementation, but it looks like we have it all squared away now (cebtenzzre#2 awaiting merge).
There are two scenarios to consider: non-finetuned (extending any model) and finetuned (using a model trained with YaRN). With the updated code we're getting quite good results under both scenarios. All evals are done with Q4_0.
For reproducibility, here is an example command line (from "Finetuned" below): ./perplexity -m yarn-llama-2-7b-64k.Q4_0.gguf -f wiki.test.raw -ngl 100 -c 16384 --rope-scaling yarn --rope-freq-scale 0.0625 --yarn-orig-ctx 4096
Model: LLaMA 2 7B
Commit: jquesnelle@f51eed1
ctx | base | scale | linear ppl | YaRN ppl | improvement |
---|---|---|---|---|---|
16384 | 57200 | 0.6000 | 7.2699 +/- 0.03976 | 6.0714 +/- 0.03235 | 1.197 |
16384 | 57200 | 0.7500 | 6.1870 +/- 0.03310 | 5.8867 +/- 0.03129 | 1.051 |
16384 | 57200 | 1.0000 | 9.6980 +/- 0.06093 | 9.6980 +/- 0.06093 | 1.000 |
16384 | 10000 | 0.1250 | 54.4799 +/- 0.36304 | 6.5957 +/- 0.03584 | 8.300 |
Commit: f3b25e4
ctx | base | scale | linear ppl | YaRN ppl |
---|---|---|---|---|
16384 | 57200 | 1.0000 | 9.6980 +/- 0.06093 | n/a |
Model: YaRN LLaMA 2 7B 64K
Commit: jquesnelle@f51eed1
ctx | base | scale | YaRN ppl |
---|---|---|---|
16384 | 10000 | 0.0625 | 5.1497 +/- 0.02717 |
In both scenarios, YaRN performs better than regular linear scaling. We additionally see that the YaRN code is equivalent to linear when the scale is 1 with or without a base change. Moreover, the perplexities match the existing code on master, meaning the changes are backward-compatible. Given this, I think it's a good candidate to merge π
IMHO when YaRN is enabled, both
freq_base
andfreq_scale
should be disabled as any changes will result in an inferior PPL compared to the default values.
@bloc97 I have --rope-freq-scale set up to configure the YaRN scale factor when "--rope-scaling yarn" is passed, which seemed simpler than making it separately configurable but mutually exclusive. And based on jquesnelle's results above, the perplexity with YaRN is 12% better with freq_base=57200 (5.8867) than with freq_base=10000 (6.5957), even after fixing the bugs in the implementation. So I'm not inclined to disable the --rope-freq-base option.
@bloc97 I have --rope-freq-scale set up to configure the YaRN scale factor when "--rope-scaling yarn" is passed, which seemed simpler than making it separately configurable but mutually exclusive. And based on jquesnelle's results above, the perplexity with YaRN is 12% better with freq_base=57200 (5.8867) than with freq_base=10000 (6.5957), even after fixing the bugs in the implementation. So I'm not inclined to disable the --rope-freq-base option.
I'll do a few more tests using the reference implementation in huggingface (to figure out whether that's a bug or actual behaviour), but I think since the finetuned YaRN models now work, we can go ahead with the merge and look at the remaining stuff as we go...
@ggerganov What are your thoughts on the current state of this PR?
The numbers look better, I think we can merge. Let me take one more look again tomorrow and will proceed.
Perplexity aside, do we have studies that show when using YARN the model is still able to recover information from the entire context? I'm thinking, for example with context shift (a.k.a. StreamingLLM, a.k.a. old context swap) we also get good perplexity for very long contexts, but the problem is that the model "forgets" stuff that goes out of scope. Does YARN solve this?
do we have studies that show when using YARN the model is still able to recover information from the entire context
There is a passkey test here but I don't know if the results of it were published anywhere.
@ggerganov @cebtenzzre The 128k YaRN FTed models have 99.4% random passkey retrieval accuracy across their entire context size. This can be tested using the file that @cebtenzzre linked. Non-FTed YaRN also has relatively high passkey accuracies but I don't have the results on hand.
The implementation is very well done, so lets merge it. I'm worried we are adding quite a lot of extra code for this feature, but hopefully it would be useful.
Btw, I continue to be skeptical about the usefulness of YARN. I expect with extending the context size to see PPL drop and not remain at the same level as the original context.
The passkey test passing is OK, but I think that a sliding-window processing using the original context size would also pass it. Would like to be proven wrong. But in any case, adding a passkey test example to llama.cpp
would be useful in general.
@cebtenzzre Are you planning to finish and test the Falcon implementation before merging?
Btw, I continue to be skeptical about the usefulness of YARN. I expect with extending the context size to see PPL drop and not remain at the same level as the original context.
PPL curves are always decreasing for finetuned YaRN models. Llama-2 YaRN models are currently the only models that has the property of the PPL always decreasing up to 128k context. Codellama has similar properties but much worse PPL due to it using older NTK-aware scaling and other confounding factors (it being a code focused model).
Also, the plots shown above are for non-FTed LLaMA 1. In this scenario, YaRN is also a huge step up from previous methods.
The passkey test passing is OK, but I think that a sliding-window processing using the original context size would also pass it. Would like to be proven wrong. But in any case, adding a passkey test example to
llama.cpp
would be useful in general.
Sliding window (either on the prompt or on the attention logits) won't work for passkey at all unless you know exactly where the passkey was in advance in the prompt (for example using an oracle), which defeats the purpose of using a long context model in the first place. In any case, you can put three passkeys (one in the beginning, one in the middle and one in the end), and ask the model to retrieve all three at the same time.
Are you planning to finish and test the Falcon implementation before merging?
Right now it's deactivated for Falcon. I'm looking into it, but I don't really understand the Metal implementation of GPT-NeoX RoPE (#3024 (comment)) - so I'm not 100% sure what to put in place of i0
(number of rotations) for each backend.
@bloc97 By sliding window I mean the strategy where the KV cache is shifted to evict old tokens and free space for new tokens. This is implemented in the main
example of llama.cpp
for infinite text generation:
This strategy retains past information beyond the limit of the context size because new tokens attend to the KV of old tokens which have "seen" the evicted tokens. I just don't know to what extend this information is retained, but intuitively, the passkey test has so small entropy (because of the repeated text) that I won't be surprised the sliding window on the KV cache to pass it. It's something that can be easily tested.
This strategy retains past information beyond the limit of the context size because new tokens attend to the KV of old tokens which have "seen" the evicted tokens. I just don't know to what extend this information is retained, but intuitively, the passkey test has so small entropy (because of the repeated text) that I won't be surprised the sliding window on the KV cache to pass it. It's something that can be easily tested.
We have tested a similar strategy like this using the SWA implementation used in Mistral 7b, and the passkey results drops to 0% after evicting the tokens in the kv-cache, while PPL stays stable. It seems that the attention algorithm does not compress past information into the new tokens (because it was never trained to do so)...
Yup, I did the test as well just now and indeed it fails: #3856
So my expectation was not correct.
With the recent refactoring, I've created quite some conflicts (sorry about that).
After we resolve we can merge straight away
llama_context_default_params()
needs to be updated, the comments don't match the initializers and there are missing initializers.
Based π«‘
@ggerganov Could you go to Settings > Moderation options > Interaction limits and block the above user (Dezzj) from commenting? The spam is getting annoying.
https://docs.github.com/en/communities/moderating-comments-and-conversations/limiting-interactions-in-your-repository
@cebtenzzre seems like this commit broke CI? The last 3 builds have failures on the v100.
I can confirm that this is working on my mac :)
seems like this commit broke CI? The last 3 builds have failures on the v100.
Erm... yeah, I can reproduce this locally. I think I screwed up something in 15f26ef. I thought I tested that change, but I must not have had CUDA enabled. I thought the PR's CI would have caught this...
this commit appears to have caused one of my phind codellama 34b 16k q5 model to emit gibberish on a CUDA machine, but not on my mac. I can provide a detailed reproduction if you want, or can wait until after your fix.
@IridiumMaster Can you confirm latest master is stable?
@cebtenzzre There is no per-user option, or at least I cannot find it
I am integrating this commit and I don't know how to disable YaRN across all other modes.
Specifically, there are now 5 new arguments required for ggml_rope_custom_inplace
. Zeroing all these new values out results in incoherent output. Only by using the values of 0 NAN 1 32 1
I am able to get coherent output again - but I have no idea if the behavior matches how RoPE behaved previously. Does that mean YaRN RoPE scaling is disabled? What values should I use to get the exact same behavior in other models as before this commit was merged? Or must the "disable" state match these values?
@IridiumMaster Can you confirm latest master is stable?
@cebtenzzre There is no per-user option, or at least I cannot find it
Hi, the latest master is stable for me. I do not see the gibberish that I did before. In case you're wanting to reproduce in future, here are some steps:
### System Prompt
### User Message
What is the capital of Nebraska?
### Assistant
I am integrating this commit and I don't know how to disable YaRN across all other modes.
Currently, attn_factor must be 1.0f, ext_factor must be 0.0f, and the rest don't matter but can be zero. Then YaRN should be fully disabled.
After this merge every single model I tried had failed to produce meaningful words. The previous commit c43c2da works fine.
OK. I found the culprit. I have to explicitly pass --yarn-ext-factor 0.0
to main
. Otherwise it gives gibberish.
This is alright:
[1698963713] Log start
[1698963713] Cmd: ./main -m models/codellama-7b-instruct.Q8_0.gguf -n 1 --top-k 1 -p " 1+1=" --yarn-ext-factor 0.0
[1698963713] main: build = 1477 (629f917)
[1698963713] main: built with clang version 17.0.4 for arm64-apple-darwin23.1.0
[1698963713] main: seed = 1698963713
[1698963713] main: llama backend init
[1698963713] main: load the model and apply lora adapter, if any
[1698963713] warming up the model with an empty run
[1698963713] n_ctx: 512
[1698963713]
[1698963713] system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
[1698963713] add_bos: 1
[1698963713] tokenize the prompt
[1698963713] prompt: " 1+1="
[1698963713] tokens: [ '':1, ' ':29871, '1':29896, '+':29974, '1':29896, '=':29922 ]
[1698963713] recalculate the cached logits (check): embd_inp.empty() false, n_matching_session_tokens 0, embd_inp.size() 6, session_tokens.size() 0, embd_inp.size() 6
[1698963713] inp_pfx: [ '':1, '':13, '':13, '##':2277, '#':29937, ' Inst':2799, 'ruction':4080, ':':29901, '':13, '':13 ]
[1698963713] inp_sfx: [ '':13, '':13, '##':2277, '#':29937, ' Response':13291, ':':29901, '':13, '':13 ]
[1698963713] sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 1, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[1698963713] generate: n_ctx = 512, n_batch = 512, n_predict = 1, n_keep = 0
[1698963713]
[1698963713] embd_inp.size(): 6, n_consumed: 0
[1698963713] eval: [ '':1, ' ':29871, '1':29896, '+':29974, '1':29896, '=':29922 ]
[1698963713] n_past = 6
[1698963713] sampled token: 29906: '2'
This is bad:
[1698963773] Log start
[1698963773] Cmd: ./main -m models/codellama-7b-instruct.Q8_0.gguf -n 1 --top-k 1 -p " 1+1="
[1698963773] main: build = 1477 (629f917)
[1698963773] main: built with clang version 17.0.4 for arm64-apple-darwin23.1.0
[1698963773] main: seed = 1698963773
[1698963773] main: llama backend init
[1698963773] main: load the model and apply lora adapter, if any
[1698963773] warming up the model with an empty run
[1698963773] n_ctx: 512
[1698963773]
[1698963773] system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
[1698963773] add_bos: 1
[1698963773] tokenize the prompt
[1698963773] prompt: " 1+1="
[1698963773] tokens: [ '':1, ' ':29871, '1':29896, '+':29974, '1':29896, '=':29922 ]
[1698963773] recalculate the cached logits (check): embd_inp.empty() false, n_matching_session_tokens 0, embd_inp.size() 6, session_tokens.size() 0, embd_inp.size() 6
[1698963773] inp_pfx: [ '':1, '':13, '':13, '##':2277, '#':29937, ' Inst':2799, 'ruction':4080, ':':29901, '':13, '':13 ]
[1698963773] inp_sfx: [ '':13, '':13, '##':2277, '#':29937, ' Response':13291, ':':29901, '':13, '':13 ]
[1698963773] sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 1, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[1698963773] generate: n_ctx = 512, n_batch = 512, n_predict = 1, n_keep = 0
[1698963773]
[1698963773] embd_inp.size(): 6, n_consumed: 0
[1698963773] eval: [ '':1, ' ':29871, '1':29896, '+':29974, '1':29896, '=':29922 ]
[1698963773] n_past = 6
[1698963773] sampled token: 0: 'β
'
The issue is because this code depends on NAN
and std::isnan
, which unfortunately breaks when compiled with LLAMA_FAST
, or -Ofast
.
If ext_factor
would never go negative,
diff --git a/common/common.h b/common/common.h
index 72a49b8..7760fb5 100644
--- a/common/common.h
+++ b/common/common.h
@@ -61,7 +61,7 @@ struct gpt_params {
int32_t n_beams = 0; // if non-zero then use beam search of given width.
float rope_freq_base = 0.0f; // RoPE base frequency
float rope_freq_scale = 0.0f; // RoPE frequency scaling factor
- float yarn_ext_factor = NAN; // YaRN extrapolation mix factor
+ float yarn_ext_factor = -1.0f;// YaRN extrapolation mix factor
float yarn_attn_factor = 1.0f; // YaRN magnitude scaling factor
float yarn_beta_fast = 32.0f;// YaRN low correction dim
float yarn_beta_slow = 1.0f; // YaRN high correction dim
diff --git a/llama.cpp b/llama.cpp
index bb60044..5748c52 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -8125,7 +8125,7 @@ struct llama_context * llama_new_context_with_model(
cparams.rope_freq_scale = 1.0f; // never scale if scaling type is none
}
- if (std::isnan(cparams.yarn_ext_factor)) { // NaN indicates 'not set'
+ if (cparams.yarn_ext_factor < 0.0f) { // negative indicates 'not set'
cparams.yarn_ext_factor = rope_scaling_type == LLAMA_ROPE_SCALING_YARN ? 1.0f : 0.0f;
}
If
ext_factor
would never go negative,
I'd be fine with that solution. Would you like to make a PR?
edit: For some reason, I can't reproduce this on Linux with clang or gcc, or on an M2 Mac, at least on CPU.
edit 2: I can't build llama.cpp with Metal on my Mac:
c++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Ofast -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
0 0x102f3b648 __assert_rtn + 72
1 0x102e63c5c ld::Fixup::applyFixup(ld::Atom const*, ld::LayoutLinkedImage const&, unsigned char*) const + 8268
2 0x102ef67d8 ___ZN2ld16LayoutExecutable27writeContentWithoutLinkEditENSt3__14spanIhLm18446744073709551615EEEy_block_invoke + 332
3 0x102ef6a14 void mapReduce<ld::Atom const*, mach_o::Error>(std::__1::span<ld::Atom const*, 18446744073709551615ul>, unsigned long, void (unsigned long, mach_o::Error&, std::__1::span<ld::Atom const*, 18446744073709551615ul>) block_pointer, void (std::__1::span<mach_o::Error, 18446744073709551615ul>) block_pointer) + 384
4 0x102ef6594 ld::LayoutExecutable::writeContentWithoutLinkEdit(std::__1::span<unsigned char, 18446744073709551615ul>, unsigned long long) + 1180
5 0x102efc020 ld::LayoutExecutable::writeToFile(char const*) + 15248
6 0x102eae2e8 main + 9424
ld: Assertion failed: (extras.otherInstrOffset != 0 && "Kind::arm64_adrp_ldr missing extra info"), function applyFixup, file Fixup.cpp, line 793.
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [main] Error 1
Seems like a bug in the XCode-provided clang 15?
#2268 (comment) - this seems to fix my problem. Really weird that it only has an effect when offloading that last non-repeating layer.
@cebtenzzre thanks for pushing the pr.
Now I'm testing this https://huggingface.co/TheBloke/Yarn-Mistral-7B-64k-GGUF and I'm getting
$ ./perplexity -t 1 -ngl 1 -m models/yarn-mistral-7b-64k.Q8_0.gguf -c 512 -f ../wikitext-2-raw/wiki.test.raw 2>/dev/null
[1]24.7243,[2]31.1885,[3]36.5431,[4]41.0809,^C
so something must be wrong, as the base model has
$ ./perplexity -t 1 -ngl 1 -m models/mistral-7b-v0.1.Q8_0.gguf -c 512 -f ../wikitext-2-raw/wiki.test.raw 2>/dev/null
[1]3.9958,[2]4.4960,[3]5.2987,[4]5.9971,^C
The gguf is recognized correctly
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.125
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = yes
and
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.125
llama_new_context_with_model: kv self size = 64.00 MB
Login to write a write a comment.
This is an implementation of YaRN RoPE scaling. See https://github.com/jquesnelle/yarn and the paper and errata.
TODO: