text-generation-webui
Option to select/target additional linear modules/layers in LORA training
#4178
Merged

Option to select/target additional linear modules/layers in LORA training #4178

computerman00
computerman001 year ago (edited 1 year ago)👍 4🚀 1

Checklist:

I've been patching this in for a while and figured I'd make a PR since I've seen some truly awesome gains with using all LLamaAttention modules(and even all linear layers).

I was inspired to start experimenting with additional targeted attention modules w/ LORA training after reading some discussions on r/LocalLLaMA, the QLORA paper, and the ReLoRA paper

"When using the standard practice of applying LoRA to query and value attention projection matrices, we are not able to replicate full finetuning performance for large base models...We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers are required to match full finetuning performance"
Source: QLORA Paper

The original LORA paper is also a must.
q,k,v, and o stand for query, key, value, output respectively.

I tried to keep changes to an absolute minimum, I however did raise the cutoff-length maximum from 2048 to 4096, if this is controversial enough we can change it back but 4k is easily achievable on 80GB of VRAM(And even 48GB for 13B models). I tend to always patch this in and figured it wouldn't hurt and potentially help/save-time-for peeps with VRAM to spare.

Here is some screenshots:
Screenshot_20231004_210046

Screenshot_20231004_210140

Screenshot_20231004_210818

Screenshot_20231004_210630

Screenshot_20231004_211210

EDIT: The above 2 screenshots shouldn't be taken for direct comparison as I realized they had different ranks(64vs32 respectively) after looking at training logs. Here is same parameter on everything but target modules for a general idea of what to expect:

Training 'llama' model using (gate, down, up, q, v, k, o) projections                                                                      
Trainable params: 125,173,760 (0.9525 %), All params: 13,141,201,920 (Model: 13,016,028,160)



Training 'llama' model using (q, v, k, o) projections
Trainable params: 52,428,800 (0.4012 %), All params: 13,068,456,960 (Model: 13,016,028,160)
initial tst push of added checkboxes to select target mods
f26ff42d
Organizing checkboxes in rows and setting default q,v for llama
0ebc9dbc
Not a fan of this, forces off by default even if user set env variabl…
c67a0db3
Organizing ui
598bce24
organizing ui
b20b1fd8
cleaning up code and reverting WANDB_MODE change I made(Although I do…
384e0204
cleaning up code and reverting WANDB_MODE change I made(Although I do…
2034ecd5
Adding info and note.
74847302
Fixed silly ui mistake
e67ece35
further clarification info
e3c73fbc
epolewski
epolewski1 year ago

I'm testing this out and it seems to be working. Few cosmetic things:
Breaks the refresh button look for me. The buttons still work:
image
On boot I see these errors. I was already getting one, but with this PR I get 2. They don't seem to affect anything:
image
It does appear to be working though!

computerman00 Merge branch 'oobabooga:main' into target_mod_sel_ui
18a652ce
Fixing PEP code-style issue that was missed
a9cff107
computerman00
computerman001 year ago (edited 1 year ago)

I'm testing this out and it seems to be working. Few cosmetic things: Breaks the refresh button look for me. The buttons still work: image On boot I see these errors. I was already getting one, but with this PR I get 2. They don't seem to affect anything: image It does appear to be working though!

Interesting, thanks for the feedback!

I have not been able to replicate this(Neither undefined buttons or that second warning), I will continue to investigate - I did just sync/merge oobabooga:main into the feature branch, I suspect maybe there was some conflicts with newer dependencies/code that wasn't present in my branch.

I would much appreciate any feedback if you do get to try again. Thanks again!

EDIT: Tested from fresh Python3.10 MConda environment w/ default modules+requirements.txt and everything worked as expected.

However, I did try testing with newest requirements.txt + my old main branch and was indeed able to replicate the "undefined" buttons. You should be okay after pulling down sync'd target_mod_sel_ui branch(Or patching training.py w/ the diff)

computerman00 Merge branch 'oobabooga:main' into target_mod_sel_ui
7c0c9175
computerman00 Merge branch 'oobabooga:main' into target_mod_sel_ui
63ebad35
FartyPants
FartyPants1 year ago (edited 1 year ago)

You are repeating the exact same thing that is already in the extensions Training PRO which is already supplied with the main distribution - even in the current version and it has been there for some time. Look in Advanced settings. The whole idea is that the training tab would remain mean and lean, the Training PRO would allow for more messing around.

image

Also, I would caution - most people should use Q,V when dealing with fine-tuned models.

Using all attention layers require higher quality and larger training set than most people are willing to put with and the model may then deteriorate or become too stiff if used on previously fine-tuned model.

You are quoting QLora paper, but you forgot to note the size and quality of guanaco dataset and that it has been used on base model. This isn't what most people do with Qlora. It isn't necessary what LORA was created for either. Q and V was suggested not because they were somehow oblivious to the fact that there are 7 projections - but because the people who suggested it found it the best performing.

Guanaco was a proof of Qlora concept: if they use all projections and high quality dataset, you can get away with Qlora instead of finetuning of base model (and it's true to some extend) - but I'm not aware of many other people doing QLORA to finetune base model on that size of dataset vs. doing full fine tune (not LORA). At least not the models people use - they are all full finetunes. It would be counterproductive because you want the best quality if you are bothering with all that.

Take it from someone who did 500+ Lora's (tried everything and now most of the time use q,v)

The original LORA paper touches on it while determining that "Adapting both Wq and Wv gives the best performance overall" The data they presented suggest that the higher the rank, the least difference choosing more projections make and so using more targets can be used to lower the rank (example a rank 64 QV would equal rank 8 QKVO.) - and this is what advanced user can keep in mind when budgeting GPU.

image

These are very important distinctions

For people who may ever read my unhinged chatter, here is my suggestion (based on experience):
Don't use more targets on the account of lowering other parameters - you'll be doing it WRONG.
if you add additional projections and your GPU doesn't OOM, that only means your initial parameters do not full utilize your training situation. You could have increased batch size or frame size (cut-off length) which will more dramatically increase quality that adding more projections.
As the data suggest - you may use more targets in order to be able to lower rank. For example you want a rank 256, but your GPU OOM. You may add QKVO and then lower the rank to 64 or 128 which may provide the same overall accuracy, but won't OOM and you don't need to lower batch or make the frames smaller which would far significantly affect the model.

oobabooga
oobabooga1 year ago👍 1❤ 1

The trainingPRO extension and the main training tab are independent. It's fine to update the main one with a missing feature, especially when the additions are tidy as is the case here.

The PR seems to have been tested and working, so let's merge it. Thank you @computerman00!

oobabooga oobabooga merged 4405513c into main 1 year ago

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone