PR #4178 Option to select/target additional linear modules/layers in LORA training

computerman001 year ago (edited 1 year ago)👍 4🚀 1

Checklist:

I have read the Contributing guidelines.

I've been patching this in for a while and figured I'd make a PR since I've seen some truly awesome gains with using all LLamaAttention modules(and even all linear layers).

I was inspired to start experimenting with additional targeted attention modules w/ LORA training after reading some discussions on r/LocalLLaMA, the QLORA paper, and the ReLoRA paper

"When using the standard practice of applying LoRA to query and value attention projection matrices, we are not able to replicate full finetuning performance for large base models...We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers are required to match full finetuning performance"
Source: QLORA Paper

The original LORA paper is also a must.
q,k,v, and o stand for query, key, value, output respectively.

I tried to keep changes to an absolute minimum, I however did raise the cutoff-length maximum from 2048 to 4096, if this is controversial enough we can change it back but 4k is easily achievable on 80GB of VRAM(And even 48GB for 13B models). I tend to always patch this in and figured it wouldn't hurt and potentially help/save-time-for peeps with VRAM to spare.

Here is some screenshots:

EDIT: The above 2 screenshots shouldn't be taken for direct comparison as I realized they had different ranks(64vs32 respectively) after looking at training logs. Here is same parameter on everything but target modules for a general idea of what to expect:

Training 'llama' model using (gate, down, up, q, v, k, o) projections                                                                      
Trainable params: 125,173,760 (0.9525 %), All params: 13,141,201,920 (Model: 13,016,028,160)



Training 'llama' model using (q, v, k, o) projections
Trainable params: 52,428,800 (0.4012 %), All params: 13,068,456,960 (Model: 13,016,028,160)

initial tst push of added checkboxes to select target mods

f26ff42d

Organizing checkboxes in rows and setting default q,v for llama

0ebc9dbc

Not a fan of this, forces off by default even if user set env variabl…

c67a0db3

Organizing ui

598bce24

organizing ui

b20b1fd8

cleaning up code and reverting WANDB_MODE change I made(Although I do…

384e0204

cleaning up code and reverting WANDB_MODE change I made(Although I do…

2034ecd5

Adding info and note.

74847302

Fixed silly ui mistake

e67ece35

further clarification info

e3c73fbc

epolewski1 year ago

I'm testing this out and it seems to be working. Few cosmetic things:
Breaks the refresh button look for me. The buttons still work:

On boot I see these errors. I was already getting one, but with this PR I get 2. They don't seem to affect anything:

It does appear to be working though!

Merge branch 'oobabooga:main' into target_mod_sel_ui

18a652ce

Fixing PEP code-style issue that was missed

a9cff107

computerman001 year ago (edited 1 year ago)

I'm testing this out and it seems to be working. Few cosmetic things: Breaks the refresh button look for me. The buttons still work: On boot I see these errors. I was already getting one, but with this PR I get 2. They don't seem to affect anything: It does appear to be working though!

Interesting, thanks for the feedback!

I have not been able to replicate this(Neither undefined buttons or that second warning), I will continue to investigate - I did just sync/merge oobabooga:main into the feature branch, I suspect maybe there was some conflicts with newer dependencies/code that wasn't present in my branch.

I would much appreciate any feedback if you do get to try again. Thanks again!

EDIT: Tested from fresh Python3.10 MConda environment w/ default modules+requirements.txt and everything worked as expected.

However, I did try testing with newest requirements.txt + my old main branch and was indeed able to replicate the "undefined" buttons. You should be okay after pulling down sync'd target_mod_sel_ui branch(Or patching training.py w/ the diff)

Merge branch 'oobabooga:main' into target_mod_sel_ui

7c0c9175

Merge branch 'oobabooga:main' into target_mod_sel_ui

63ebad35

FartyPants1 year ago (edited 1 year ago)

You are repeating the exact same thing that is already in the extensions Training PRO which is already supplied with the main distribution - even in the current version and it has been there for some time. Look in Advanced settings. The whole idea is that the training tab would remain mean and lean, the Training PRO would allow for more messing around.

Also, I would caution - most people should use Q,V when dealing with fine-tuned models.

Using all attention layers require higher quality and larger training set than most people are willing to put with and the model may then deteriorate or become too stiff if used on previously fine-tuned model.

You are quoting QLora paper, but you forgot to note the size and quality of guanaco dataset and that it has been used on base model. This isn't what most people do with Qlora. It isn't necessary what LORA was created for either. Q and V was suggested not because they were somehow oblivious to the fact that there are 7 projections - but because the people who suggested it found it the best performing.

Guanaco was a proof of Qlora concept: if they use all projections and high quality dataset, you can get away with Qlora instead of finetuning of base model (and it's true to some extend) - but I'm not aware of many other people doing QLORA to finetune base model on that size of dataset vs. doing full fine tune (not LORA). At least not the models people use - they are all full finetunes. It would be counterproductive because you want the best quality if you are bothering with all that.

Take it from someone who did 500+ Lora's (tried everything and now most of the time use q,v)

The original LORA paper touches on it while determining that "Adapting both Wq and Wv gives the best performance overall" The data they presented suggest that the higher the rank, the least difference choosing more projections make and so using more targets can be used to lower the rank (example a rank 64 QV would equal rank 8 QKVO.) - and this is what advanced user can keep in mind when budgeting GPU.

These are very important distinctions

For people who may ever read my unhinged chatter, here is my suggestion (based on experience):
Don't use more targets on the account of lowering other parameters - you'll be doing it WRONG.
if you add additional projections and your GPU doesn't OOM, that only means your initial parameters do not full utilize your training situation. You could have increased batch size or frame size (cut-off length) which will more dramatically increase quality that adding more projections.
As the data suggest - you may use more targets in order to be able to lower rank. For example you want a rank 256, but your GPU OOM. You may add QKVO and then lower the rank to 64 or 128 which may provide the same overall accuracy, but won't OOM and you don't need to lower batch or make the frames smaller which would far significantly affect the model.

oobabooga1 year ago👍 1❤ 1

The trainingPRO extension and the main training tab are independent. It's fine to update the main one with a missing feature, especially when the additions are tidy as is the case here.

The PR seems to have been tested and working, so let's merge it. Thank you @computerman00!

oobabooga merged 4405513c into main 1 year ago

text-generation-webui
Option to select/target additional linear modules/layers in LORA training
#4178

Merged

Option to select/target additional linear modules/layers in LORA training #4178

Checklist:

text-generation-webui Option to select/target additional linear modules/layers in LORA training #4178 Merged

Option to select/target additional linear modules/layers in LORA training #4178

Checklist:

text-generation-webui
Option to select/target additional linear modules/layers in LORA training
#4178

Merged