Thanks a lot for this PR. I only skimmed it for now and have yet to read the details from the paper but this looks quite promising!
When looking at the implementation, I saw that the way the Eva weights are initialized sits a bit "outside" of how we normally do this in PEFT. The normal way is to have this on the layer level, check for instance here:
peft/src/peft/tuners/lora/layer.py
Lines 128 to 139 in 5758a7e
My guess why this is implemented differently here is that there is currently one pass through the modules that determines the individual ranks and creates the state dict. If this was refactored such that the weights were initialized on the layer level as usual, we would need two passes, one to adjust the ranks in the config and then the actual weight initialization on the layer level. Would it be possible to refactor the code to enable that?
My main concern with the current approach is that the weight initialization currently is happening in one place of the code but now there would be another place, making the code more disorganized. Second, IIUC, we would initialize the weights once and then override them with Eva parameters, which is wasteful. Finally, if we do this on the layer level, that could solve the multi-GPU issue, right?
Apart from this, what I noticed while skimming is that there is a big class IncrementalPCA
which looks like a rewrite of the equivalent sklearn class using torch. My question is: Did you implement this from scratch or is it taken from somewhere? This has implications on how well this is tested etc.
I also had difficulties understanding the SVDHook
and HashHook
, could you please add a small docstring? Are these hooks permanently added?
Regarding torch.compile
, we could add a test for that and see if it works. But this is not a necessity for this PR to land in PEFT, just something that's nice to have.
Going forward, we'll also have to add a few unit tests, but I'm fine with iterating on the implementation first and then adding the tests.
Thanks a lot for the quick feedback @BenjaminBossan !
I added docstrings to SVDHook
and HashHook
, hope this makes their purpose a bit more clear.
Regarding your other concerns about the implementation:
I agree its not that our proposed init method is not on a layer level as the other inits. I thought a lot about where in the code to put the init and unfortunately I think it would be quite difficult to put the init directly in the layer. (or at least I haven't found an elegant way to do it). The way eva is initialized is that we do multiple forward passes through the model, and incrementally perform SVD on the input activations of the target layers (usually more than 10 forward passes). The resulting singular vectors are used to initialize the Lora A matrices. So unfortunately 2 passes would also not be enough. We would also need to somehow pass an instance of EvaConfig
to the LoraLayer
.
What we could maybe do is create the eva_state_dict as it is created now and then somehow pass the state dict to LoraModel._create_and_replace
and then the relevant tensors to the update_layer
methods of individual layers.
I did also add a statement in the update_layer
method to intialize the Lora B matrices with zeros as needed for eva. For this reason we do not initialize weights twice as the A matrices are not initialized in the update_layer
method.
Regarding the new eva code. IncrementalPCA is indeed largely take from the sklearn class. So a lot of the code is directly copied from there. What we added is seamless gpu support by setting the device for all internal variables / buffers to the device of the first input tensor. We also integrated support for torch.svd_lowrank to significantly speed up computations. We did several tests internally if the results are equivalent when running this on gpu compared to running the sklearn implementation on cpu.
Thanks for the further explanations. Let me discuss the points separately.
I see the issue with the cycling through the data. This is indeed tricky but I think it's worth putting some time into considering alternative solutions.
Right now, the dataloader
is added to the config. We should really try to avoid this. The config should be a small, json-serializable object. This means it can't really contain any reference to the data. Moreover, there are so many unknowns about the data and how the model is called that I don't think we should have code like:
inputs = {k: v.to(device) for k, v in next(iter(dataloader)).items() if k != "labels"}
model(**inputs)
Let me throw out an alternative: Let's say that we had a function called initialize_lora_eva_weights
or something like that. The API would be:
# normal PEFT code
base_model = ...
lora_config = LoraConfig(...)
model = get_peft_model(base_model, lora_config)
# new eva code
eva_config = EvaConfig(...)
sample_data = ... # user needs to define this instead of relying on data loader
def forward_fn(model):
return model(**sample_data)
model = initialize_lora_eva_weights(model, config=eva_config, forward_fn=forward_fn, adapter_name="default")
This would make it so that Eva is a function that is applied to the LoRA model after it was initialized. This avoids the issues I mentioned above.
The disadvantages I see with this approach:
Do you think this would work with the Eva algorithm or is there something I'm missing? LMK what you think.
I wonder if we could use some of the sklearn unit tests and add them here to ensure that the IncrementalPCA
produces the correct results. That way, we have the safety of knowing it works as expected without too much effort of adding our own unit tests.
Of course, this is all a bit outside of scope for PEFT, but if there is no external package that provides the functionality, it would be okay to add it to PEFT.
I was wondering already if having the dataloader in the confige is a good idea since you probably only want primitive types there. I think your suggested solution is a sensible one. I gave it some thought and dont really have a better suggestion following the restrictions we discussed.
A few questions for clarification:
init_lora_weights
in LoraConfig
. I am a bit worried about discoverability of eva if LoraConfig
would then just be normal lora settings and someone needs to know about the function initialize_lora_eva_weights
to use eva.SVDHook
we currently have it implemented like thistry:
states = input.detach()
except AttributeError:
states = input[0].detach()
So for cases where we input is a tuple, we just take the first value. Not super important at this stage I guess but I just wanted to ask if you think we should add an argument so the user can define what the inputs we should use are.
It's a good idea to take some from sklearn. All tests for this class are in this file. Right now there is nothing like this available anywhere. I built my own package torch-incremental-pca but I just assumed you dont want any dependencies to small packages like this so I just copied the class over here. I did however create a pull request to have this feature implemented in pytorch directly so in the future we might be able to remove this from peft again.
I was wondering already if having the dataloader in the confige is a good idea since you probably only want primitive types there. I think your suggested solution is a sensible one. I gave it some thought and dont really have a better suggestion following the restrictions we discussed.
Okay, then let's go with this general approach. Maybe we have some ideas for improvements along the way, then we can still pivot.
- Would eva still be a valid value for
init_lora_weights
inLoraConfig
. I am a bit worried about discoverability of eva ifLoraConfig
would then just be normal lora settings and someone needs to know about the functioninitialize_lora_eva_weights
to use eva.
Yes, let's keep it as an option. By itself, it wouldn't do anything, but it helps with discoverability. We could also consider setting some kind of flag and when users want to use the model with Eva init but before calling the function we discussed, the flag could be checked and raise an error/warning. Moreover, with this init option being set, we can signal to other packages that they need to call the function (e.g. trl could automatically call this in their SFTTrainer
).
- In your suggested solution, would sample data not be a dataloader?
Exactly, I don't think we absolutely need a dataloader, do we? AFAICT, we only need it to get a batch of data. If it needs to be a different batch each time, we could do:
iter_dl = iter(cycle(dataloader)) # optionally tqdm
def forward_fn(model):
sample_data = next(iter_dl)
return model(**sample_data)
Is it necessary to define a custom forward function for the eva init? Usually it should be anyway the same inputs that the model receives during normal finetuning. I did adapt the line you mentioned filtering out the label key from the inputs dict.
My reasoning for this is that there are too many possibilities how the model could be called, we can't cover all of them. The assumption that we can just filter out labels and then call model(**inputs)
might mostly work for standard HF LLM architectures, but this method should also work with other model types, right? Let's keep the possibilities open.
But thinking more about it, I'd say what we can do is to keep your existing API of passing the dataloader and give the option for a custom forward_fn
on top. That way, there is a more convenient way for the standard models and a more flexible and advanced way for all the others.
- One question about how we get the layer activations for SVD.
I'd definitely rather have an instance
check than a try ... except
for this. But I'm not sure if that's what you were asking.
I just assumed you dont want any dependencies to small packages like this so I just copied the class over here.
That's correct. We could make it an optional dependency, as it is only needed for Eva. But we could also copy the code over to PEFT ("vendoring"), which is similar to what we did with BufferDict
. What I would like to see: If you copy some of the sklearn tests to your package and ensure that they pass, we wouldn't need them in PEFT too, we could just say: "check this URL for the original code". This way, there is a bit less clutter.
I did however create a pull request to have this feature implemented in pytorch directly so in the future we might be able to remove this from peft again.
Nice. Of course, even if it lands, we can't use it immediately as we want to support older PyTorch versions. But definitely add a comment with a link to the PR. I would imagine, however, that PyTorch won't be too keen on the sklearn API, but let's see.
@BenjaminBossan before going ahead with this implementation of calling a function/method after get_peft_model
is called I just wanted to propose one alternative idea to get your thoughts on it.
Instead of using LoraModel
we create a subclass EvaModel
. This model class initially would not have any adapter layers but we would add the forward hooks needed for the SVD computations. In the __init__
we would have something like
class EvaModel(LoraModel):
def __init__(self, model, config, adapter_name)
self.model = model
for layer in config.target_modules:
module = get_module(self.model, layer)
module.register_forward_hooks(SVDHook(config))
During the first few forward passes of finetuning we would not actually update any weights but use the activations to compute the eva state dict. Once SVD is converged for all layers we would switch the internal state of the model to finetuning (with some boolean variable) and at that point only add the adapters with the appropriate eva values for the lora A matrices.
Hope this makes sense. That way we could leverage the finetuning loop directly. It seem a bit more seamless and works without needing any additional function calls after get_peft_model
. But of course there might be some hidden issues I am not considering at the moment.
I like the proposal to seamlessly integrate this into the training loop. I'm not sure if we need a separate class for that, but you could come up with a prototype and I can check if there is a good way to integrate it without a separate class.
With this, we would need to make sure that it works with vanilla training loops but also frameworks like Trainer
from transformers. If it does, I think this could be the optimal solution.
@BenjaminBossan I started implementing this approach of using the first few batches for the initialization and unfortunately I realised that there are at least 2 major issues with it:
These issues makes me wonder if some variant if my initially proposed solution would be feasible (computing the eva state dict before initializing the adapters)
One idea I would have which avoids passing a dataloader or something similar through the config would be to enable the get_peft_model
function to accept some optional keyword arguments through **kwargs
and passing the dataloader as well as a model forward_fn
this way.
def get_peft_model(
model: PreTrainedModel,
peft_config: PeftConfig,
adapter_name: str = "default",
mixed: bool = False,
autocast_adapter_dtype: bool = True,
revision: Optional[str] = None,
**model_init_kwargs
):
...
PeftModel(model, peft_config, adapter_name=adapter_name, autocast_adapter_dtype=autocast_adapter_dtype, **model_init_kwargs)
...
peft_model = get_peft_model(model, peft_config, eva_dataloader=dataloader, eva_forward_fn=eva_forward_fn)
We also don't directly have to modify the __init__
method of LoraModel
but define a _pre_injection_hook
for LoraModel
which should also do the trick.
Let me know what you think about the issues I mentioned with the previously proposed approaches. If my suggested approach is not feasible, I am fine with going ahead with the implementation suggested by you with a initialize_lora_eva_weights
function (this will also unfortunately suffer from the downside mentioned above that already initialized lora weights will take up gpu memory that could potentially be useful for the svd computation, which we might want to run with a higher batch size than finetuning).
I started implementing this approach of using the first few batches for the initialization and unfortunately I realised that there are at least 2 major issues with it:
Good points. There would probably be ways to work around this or accept these shortfalls, but I agree that this is not ideal.
One idea I would have which avoids passing a dataloader or something similar through the config would be to enable the
get_peft_model
function to accept some optional keyword arguments through**kwargs
and passing the dataloader as well as a modelforward_fn
this way.
I'm a bit reluctant to add more options to get_peft_model
, especially when they're only used in a very narrow use case (e.g. this option only works for LoRA but get_peft_model
is the entry point for all PEFT methods). Therefore, I tend to prefer the option of having a dedicated function.
To avoid initializing the LoRA weights through get_peft_model
, I have a proposal. We recently added an option low_cpu_mem_usage=True
to PEFT, which will results in the PEFT weights being initialized on meta device. This option has not been propagated to get_peft_model
because until now, there was no reason to have it there. However, with Eva, it would be useful. So we could add the option there and then just pass it along to PeftModel
in this line, which already accepts that argument. That way, we avoid initializing the LoRA weights. WDYT about that?
@BenjaminBossan thanks for your suggestions, I implemented a prototype solution now for the discussed approach. A few questions:
from peft import
initialize_lora_eva_weights`. Would it be suitable to then put this function in peft.utils or should it be in peft.tuners.lora. I didnt see any other functions being pulled directly into the peft namespace from the tuners directory.low_cpu_mem_usage=True
as a default for eva. That way we would even need to have it as an argument for get_peft_model
I implemented a prototype solution now for the discussed approach.
Thanks a lot. I think the API is decent and given the trade-offs we discussed, probably a good compromise.
- I moved the IncrementalPCA class which has been adapted from sklearn to peft.utils.incremental_pca.py because I thought it is not strictly part of eva and in theory could be used for other things, does that make sense?
Yes, that's fine. In the end, I think it doesn't really matter. What's important is that it is not in the peft
namespace (i.e. from peft import IncrementalPCA
should not be possible), apart from that it doesn't matter too much. We can always move it later.
- Ideally users should be able to do
from peft import
initialize_lora_eva_weights`. Would it be suitable to then put this function in peft.utils or should it be in peft.tuners.lora. I didnt see any other functions being pulled directly into the peft namespace from the tuners directory.
We do pull objects from peft.tuners
into the peft
namespace, otherwise from peft import LoraConfig
would not be possible :) As for initialize_lora_eva_weights
: Right now, this only works for LoRA and it's not planned to change that, right? Therefore, I would keep it in the tuners/lora
directory. There, import it into __init__
, then from there into the __init__
from tuners/
and from there into the __init__
from peft
.
- I was wondering if we should use
low_cpu_mem_usage=True
as a default for eva. That way we would even need to have it as an argument forget_peft_model
How exactly would you do that, given that the get_peft_model
is done by the user?
@BenjaminBossan thanks again for your feedback!
We do pull objects from
peft.tuners
into thepeft
namespace, otherwisefrom peft import LoraConfig
would not be possible :)
I pulled initialize_lora_eva_weights
and get_eva_state_dict
into the peft namespace now. I asked because I only saw classes being pulled from tuners, no functions so I was wondering if there is some rule for that.
How exactly would you do that, given that the
get_peft_model
is done by the user?
I thought we could just not have a check in the LoraModel.__init__
low_cpu_mem_usage = init_lora_weights="eva" or low_cpu_mem_usage
Regarding tests I adapted some tests from sklearn for the incremental_pca class in tests/test_incremental_pca.py
. Could you maybe advice me what the minimum test setup should look like for eva.
Thanks a lot for these updates to the PR.
Please always run make style
after you made changes to the code so that the linter will pass.
I pulled
initialize_lora_eva_weights
andget_eva_state_dict
into the peft namespace now. I asked because I only saw classes being pulled from tuners, no functions so I was wondering if there is some rule for that.
Ah okay, no worries, there are no rules about that.
I thought we could just not have a check in the
LoraModel.__init__
low_cpu_mem_usage = init_lora_weights="eva" or low_cpu_mem_usage
I see. I think I'd rather not automatically switch this, maybe users have a specific reason why they need low_cpu_mem_usage=False
despite using Eva. What you could do is give a warning if you find this, advising users that passing low_cpu_mem_usage=True
is most likely what they want.
Regarding tests I adapted some tests from sklearn for the incremental_pca class in
tests/test_incremental_pca.py
.
I think what you have is more than enough, the test coverage for the class is very good. Let's just add a comment on where you adopted these tests from.
Could you maybe advice me what the minimum test setup should look like for eva.
This is a very good question. From my understanding, we can't really check the adjusted weights to say that after applying Eva, the weights should have this or that property. Just as an example, for PiSSA/OLoRA, we have tests that they reduce the quantization error. But I don't think we can have something similar here, right?
What we could do is just calling initialize_lora_eva_weights
with a couple of different options to ensure that there is no error being raised. This helps with the test coverage but it wouldn't really help with ensuring that Eva actually works as described in the paper.
Maybe we can expand the included example to replicate one of the findings from the paper. That way, we would have a high confidence that it works as promised, WDYT about that?
Alternatively, we could create small self-contained training runs in the test, using a small model and dataset. Then we would train the model once with and once without Eva. The test would be e.g. that with Eva, the model converges faster. This would be a more functional test but of course it has the danger of being flaky/seed dependent and could be slow to run.
@BenjaminBossan thanks for all your suggestions, I integrated all of it into the pull request.
Regarding testing I added a new class TestEvaInitialization
into tests_initialization.py
. The tests basically run eva initialization three times for different seeds and check that the average cosine similarity between the same svd component from different seeds is above a certain threshold. This invariance to batch ordering is actually also one of the findings from our paper. I fear doing a finetuning run would probably take up too much resources / take to long for a testcase.
Overall I defined five test cases:
get_eva_state_dict
function which returns the state_dict for eva. We use a non-peft model because testing peft models is covered with the next testinitialize_lora_eva_weights
with a LoRA model which internally also calls get_eva_state_dict
__post_init__
method of EvaConfig
for invalid settings.LoraConfig
for valid eva config settingsDuring testing I noticed a potential issue with how the loftq_config
field is defined in LoraConfig
. I added an issue and could also create a separate pull request for it.
Thanks a lot for all the updates. I did a more in-depth review now and found a few potential issues, please check my comments. Apart from those, could you please:
make style
after all changes are done486 | ) | ||
487 | module.lora_A[adapter_name].weight.copy_(w) | ||
488 | else: | ||
489 | module.update_layer( |
What would be a situation when this can occur? During the unit testing, it is not covered.
Its mostly because we dont support embedding and conv layers at the moment. So if adapters are added to those layers we cant initialize it with layer so the next best thing would be to do normal lora initalization by default. Do you think a separate test case is needed here?
90 | |||
91 | if previous_components is not None: | ||
92 | components = self.svd.components_ | ||
93 | if len(components.shape) == 1: |
Could you add a short comment when this would happen?
done, its only the case during the first svd step. Would you do it differently?
@BenjaminBossan I integrated all of your suggestions in the code, many thanks (for details I left comments to the separate conversations you opened about suggested changes).
Three general questions:
pytest.mark.skipif(not torch.cuda.is_available(), reason="test requires a GPU")
to test bitsandbytes. Is that fine to have all EVA tests in one class in this script or would I need to move this test to test_common_gpu
or test_gpu_examples
?examples/eva_finetuning/README.md
I added sections using the [[autotest]] command to automatically use the docstrings of the eva related functions. Is that fine or does that not work for the examples section?init_lora_weights="eva"
and if EvaConfig
is not defined / none it is initialized with the defaults (i.e. EvaConfig()
)?EDIT:
I added three checks to initialize_lora_eva_weights
.
model
is a PeftModel
I also removed peft_config
as an argument from the function since its anyway an attribute of every PeftModel
Thanks a lot for your excellent work. Not much is missing to be able to merge this.
Eva works out of the box with bitsandbytes. I included a gpu-dependent test in test_initialization with pytest.mark.skipif(not torch.cuda.is_available(), reason="test requires a GPU") to test bitsandbytes. Is that fine to have all EVA tests in one class in this script or would I need to move this test to test_common_gpu or test_gpu_examples?
For this test to be actually run on our nightly CI with GPUs, you'd have to move it, I think test_gpu_examples.py
is the right place (just create a new test class there).
under examples/eva_finetuning/README.md I added sections using the [[autotest]] command to automatically use the docstrings of the eva related functions. Is that fine or does that not work for the examples section?
Hmm, not sure what you mean, I'm not familiar with autotest
. Could you please elaborate?
To make using eva more seamless, do you think its a good idea to just have to specify init_lora_weights="eva" and if EvaConfig is not defined / none it is initialized with the defaults (i.e. EvaConfig())?
I think it makes sense to do that if the default config works well enough out of the box. If we want the users to actively set the Eva parameters, it is better to make this step explicit. You can judge better than I can whether the default parameters should work well enough.
533 | module.update_layer(r=w.size(0), init_lora_weights=peft_config.init_lora_weights, **update_layer_kwargs) | ||
534 | module.lora_A[adapter_name].weight.copy_(w) | ||
535 | else: | ||
536 | module.update_layer(r=peft_config.r, init_lora_weights=True, **update_layer_kwargs) | ||
537 | missing_eva_inits.append(name) | ||
538 | print(name, type(module)) |
What would be a situation where this occurs? AFAICT it's not covered in tests. Also, let's remove the print statement.
Add a test for this test_missing_eva_inits
. Since EVA at the moment does not support embedding layers for example, we are doing standard lora init if such layers have adapters.
Thank you again for your suggestions @BenjaminBossan, I worked them all into the pull request.
test_gpu_examples.py
. It was necessary to duplicated some code but I guess thats fine.get_indices_fn
for customizing EVA. I made it a bit more generic and easier for users to understand. I renamed the callable argument in get_eva_state_dict
and initialize_lora_eva_weights
to prepare_model_inputs_fn
now. It can be used to modify the model inputs and have access to the result in the SVD hooks. I adjusted the tests accordingly.EvaConfig
with my coauthors. Some parameters like rho
or user_label_mask
have a setting that should work well for most applications but nevertheless it can be important to adjust the default settings. We were wondering if we should initialize a default EvaConfig
if its not specified but raise a warning explaining that it might be important to change the parameters in some cases. In either case I also updated the docstring of EvaConfig
to mention this.The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Thanks a lot for all the updates. The CI is a bit flaky at the moment, so don't worry about some failing test runs.
While reviewing the code, I found one major blocker that needs to be addressed before we can merge. This issue is that Eva redistributes the LoRA ranks, right? However, this is not reflected in the LoraConfig
. Therefore, if a user saves the model and then tries to load it later, they'll get errors, e.g.:
size mismatch for base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([19, 2048]) from checkpoint, the shape in current model is torch.Size([16, 2048])
So unless I'm missing something, the solution for this would be to get the new ranks for all the LoRA weights and store them in a dict mapping the module name to the rank. This dict should be used to override the rank_pattern
attribute of the LoraConfig
. This should allow to successfully load the model. Could you please take a look at that and also extend the tests to cover this?
Regarding the alpha values, I'm not sure if those also need to be adjusted. E.g. if I have a LoRA rank of 16 and alpha of 32, the LoRA scaling would be 32/16. If the rank is changed to 19 but alpha stays at 32, the scaling would decrease to 32/19. Probably it's not a big deal, but I wanted to mention that.
I also added a section for eva under package reference, please let me know if its okay like this. I think it would be beneficial to find info about the EVA functions in the documentation.
Yes, that should be okay. Unfortunately, the doc preview appears to be broken at the moment, but if there is anything wrong with the rendering there, it'll be easy enough to fix later.
We were wondering if we should initialize a default EvaConfig if its not specified but raise a warning explaining that it might be important to change the parameters in some cases.
From my point of view, the current solution is good.
@BenjaminBossan thanks for raising the issue with saving/loading eva models, didnt think about this at first. It should be fixed now. I am updating the rank_pattern
in init_lora_eva_weights
.
I also found another potential issue that was not accounted for so far. It is possible that after eva initialization some adapter layers could get assigned rank 0 . In those cases we would need to remove the adapter layer again. We would also need to update target_modules
in the peft config. I also added test cases for this and the above issue. To test the rank=0
case it was necessary to move some functionality out of init_lora_eva_weights
into a new function load_eva_state_dict
. Otherwise it would not have been possible to modify the eva_state_dict (i.e. assigning one layer rank 0) in the test script. init_lora_eva_weights
now just combines the two lower level functions get_eva_state_dict
and load_eva_state_dict
Regarding alpha values, it should be fine as is for now. For the experiments in our paper we did some initial tests to adjust alpha for different ranks after the redistribution but it didn't lead to an improvement in performance. I saw that the AdaLoRA implementation also did not adjust alpha for different ranks, is that right? I did however add an use_alpha_pattern
option in EvaConfig
now which is False by default. If this is set to True
alpha values will be adjusted after redistribution to reflect the new ranks.
Regarding docs, I built it locally and everything looks fine.
Thanks for your continued work. The adjustments for saving and loading look good. Also, nice catch with the rank 0 issue.
I have a couple of smaller comments, but the big picture looks good.
@BenjaminBossan thanks for your thoroughness, there were still some edgecases I didnt consider.
Besides the things you pointed out I updated some things additionally (sorry that there are still so many changes in the PR, but I keep discovering potential issues for users or ways to make the api simpler for users)
get_eva_state_dict
if LoraConfig
contains a rank pattern. For these cases max_components
can be different for different layers so this variable is now a dict with keys being the adapter module names.load_eva_state_dict
and integrated functionality into initialize_lora_eva_weights
. This function now accepts an optional eva_state_dict
argument.One question: When loading the eva_state_dict I need to get the module name in the base model based off the module name in the peft model. I this a valid way to do that:
name_in_base_model = name.replace("base_model.model.", "")
Thanks for your continued work on this PR, this is really excellent work.
Regarding your latest changes, they make sense to me, thanks for explaining. I did another check and flagged some smaller issues, mainly around making this approach more flexible for custom models. Otherwise, this PR is good to be merged.
One question: When loading the eva_state_dict I need to get the module name in the base model based off the module name in the peft model. I this a valid way to do that:
Yes, I think the way it's implemented is good. For LoRA, we will always have this prefix.
@BenjaminBossan, thanks integrated you suggestions and left some comments to the open threads:
update_layer
is neededNone
check for dataloaderThanks for the updates, I think we're almost good to go. Just 2 small comments from my side.
when calling update_layer is needed
Sorry, I can't find the comment, could you please paste it again? Just to summarize, my main concern is that we may call module.update_layer
even if it's not necessary, namely when the module was not initialized with low_cpu_mem_usage=True
(i.e. when it is an real device, not meta, we would not need to initialize it).
@BenjaminBossan this is the dicussion about update_layer
.
For the other two open conversation I added a comment directly to the conversation.
I replied to your latest comments. Also, there is a small merge conflict, could you please take care? Thanks.
Great, thanks for your feedback, I made some small changes, I think the only open issue is the potential circular import constants.py
.
Small note: I changed the default value of rho
back to 1.0
as this works well in most settings.
@sirluk Some Eva tests are failing:
https://github.com/huggingface/peft/actions/runs/11698806717/job/32754904346?pr=2142#step:6:1195
Is it possible that you tested this with a GPU and there it's fast enough, but on the CPU CI runners, it's too slow? Maybe we should remove the timing assertions, as they tend to be flaky?
Thanks for pointing this out, I was indeed testing on a gpu machine so it was automatically running everything on gpu. I removed the speed test now.
Thanks a lot for ironing out the last kinks. The PR now LGTM. Really excellent work, thanks for that and for your patience.
I hope that many users will give it a try, as it looks quite promising.
Great, thanks for your great feedback throughout the process!
Great PR!
@sirluk I just wanted to make you aware of PR #2231 which introduces a (at least superficially) similar method to EVA, which also benefited from our discussion here to improve the API. The PR includes a small table that summarizes a couple of methods including EVA, maybe you can double check that it's accurate.
@BenjaminBossan, thanks for making us aware of this recent PR. It is indeed very relevant and interesting to us as well. I'm glad our discussion benefits the integration of other data-driven methods into PEFT. We believe the table is not entirely correct with respect to EVA. From their paper, "Knowledge Maintenance mode" in CorDA simply means using the smallest components for initialization, rather than the largest ones. This is necessary for methods like CorDA and PiSSA because they modify the base model weights, meaning the base model cannot be restored after fine-tuning. We intentionally followed the design of LoRA and left the base model untouched as this allows restoring the base model's weights and therefore full knowledge preservation. We just recently added a discussion paragraph on that in the updated version of our paper which will be on arXiv soon.
(tagging @paischer101 as he is the other joint first-author of EVA)
@sirluk Thanks for the reply. Somehow I didn't get a notification for your message or missed it, which is why I didn't respond earlier.
The CorDA PR is almost finished but there is still an opportunity to ask the authors to correct the table. From your comment, it was not quite clear to me what exactly was not entirely correct. Could you perhaps post directly on that PR and mention what could be improved?
@BenjaminBossan thanks for the response. Since EVA does not alter the base model weights, it is able to recover the base model and therefore allows full knowledge maintenance, i.e. the last column in the table should be a "yes" for EVA as well. I will make a post on the PR to clarify.
Hey @BenjaminBossan , just wanted to let you know there is a mistake in the latest peft release notes describing EVA. It says that we use "SVD of the base layer weights" which is not correct. EVA is a data driven method that performs incremental SVD on minibatches of finetuning data. The result is used to initilize LoRA and perform an adaptive rank redistribution. Could you maybe change that so that potential users are not confused?
@sirluk Thanks for notifying us. I changed the release text accordingly.
Login to write a write a comment.
Description
In our work "One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation" (paper) we introduce a new data-driven initialization method for LoRA adapters. We will present this work at ENLSP (oral) and AFM at NeurIPS'24.
Eva initializes LoRA in a data-driven manner based on information of the downstream data using SVD. It also adaptively allocates ranks throughout the model.
Implementation Details
EvaConfig
which at minimum needs to receive a value for thedataloader
argument. Other parameters have default values which worked well in our experiments. The instantiated EvaConfig is passed as a parameter to LoraConfig. (we took inspiration from the LoftQConfig implementation).In LoraConfig "init_lora_weights" needs to be set to "eva"
__init__
method ofLoraModel
to create a state dict for eva initialzation before initializing the parent class to be able to populate therank_pattern
argument inLoraConfig
. This way we directly initialize with a modified rank distribution allowing us to directly load the eva_state_dict.Open Questions
self.model.load_state_dict(eva_state_dict, strict=False)
is always a valid way to initialize. E.g. when the model is wrapped with torch.compileExample Usage