Method comparison: LoRA that targets MLP modules (#2845)
The "LoRA Without Regret" blog
post (https://thinkingmachines.ai/blog/lora/) mentions that targeting
the MLP part of the transformer is more effective than targeting the
attention modules. This experiment tests this by targeting:
["gate_proj", "up_proj", "down_proj"]
instead of the default layers (["q_proj", "v_proj"]).
I chose a rank to match the parameter count we would get when targeting
the attention modules with rank 32, which is rank 10. Testing on my
machine, there is indeed a nice improvement in the test score:
| metric | target attention | target MLP |
|----------------------|------------------|------------|
| test accuracy | 48.2% | 51.3% |
| # trainable params | 9175040 | 9461760 |
| peak memory reserved | 20.74 GB | 23.02 GB |
There is, however, also a marked increase in memory usage, despite
matching parameter count. Since the operations are different, this may
not be a surprise, but let's wait for the final verdict once this
experiment runs on our AWS instance.
Note: I also tested higher and lower ranks when targeting the MLP. The
effect on memory usage was negligible, but it did improve the score:
| metric | rank 8 | rank 10 | rank 12 | rank 32 |
|--------------------|---------|---------|----------|----------|
| test accuracy | 50.3% | 51.3% | 52.2% | 54.8% |
| # trainable params | 7569408 | 9461760 | 11354112 | 30277632 |
In the end, I chose only to add the rank 10 experiment to match the
number of trainable parameters.