peft
8d8aa0b7 - Method comparison: LoRA that targets MLP modules (#2845)

Commit
107 days ago
Method comparison: LoRA that targets MLP modules (#2845) The "LoRA Without Regret" blog post (https://thinkingmachines.ai/blog/lora/) mentions that targeting the MLP part of the transformer is more effective than targeting the attention modules. This experiment tests this by targeting: ["gate_proj", "up_proj", "down_proj"] instead of the default layers (["q_proj", "v_proj"]). I chose a rank to match the parameter count we would get when targeting the attention modules with rank 32, which is rank 10. Testing on my machine, there is indeed a nice improvement in the test score: | metric | target attention | target MLP | |----------------------|------------------|------------| | test accuracy | 48.2% | 51.3% | | # trainable params | 9175040 | 9461760 | | peak memory reserved | 20.74 GB | 23.02 GB | There is, however, also a marked increase in memory usage, despite matching parameter count. Since the operations are different, this may not be a surprise, but let's wait for the final verdict once this experiment runs on our AWS instance. Note: I also tested higher and lower ranks when targeting the MLP. The effect on memory usage was negligible, but it did improve the score: | metric | rank 8 | rank 10 | rank 12 | rank 32 | |--------------------|---------|---------|----------|----------| | test accuracy | 50.3% | 51.3% | 52.2% | 54.8% | | # trainable params | 7569408 | 9461760 | 11354112 | 30277632 | In the end, I chose only to add the rank 10 experiment to match the number of trainable parameters.
Parents
Loading