peft
e2822051 - Fix various test errors in the single GPU case (#3031)

Commit
8 days ago
Fix various test errors in the single GPU case (#3031) This addresses some of the errors reported by running the tests on a single GPU machine. I will list the error messages and a short explanation of the fix. > `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_lora_gptq_quantization_from_pretrained_safetensors - NameError: name 'BACKEND' is not defined` The test was using GPTQModel without marking the test as requiring it leading to an error. This is fixed by marking the test with `requires_gptqmodel`. > `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_only_params_are_updated[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True` > `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_disable_adapters_with_merging[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True` This test fails because sometimes the gradients of the trainable tokens delta is 0 but only when training on CUDA, CPU is fine. This is a weird one and I'm not sure if this is a good fix or not. I encountered this error on two machines (1xL40S and 4xA10G) and I was not able to pinpoint this to something particular in the environment, i.e. PEFT version (tested v0.17 to main), transformers version (tested 4.5{5,6,7}, 5.0), CUDA version (tested 12.6, 12.8) or torch version (tested 2.7, 2.8, 2.9, 2.10). I also set `LD_LIBRARY_PATH=` before running pytest to exclude cuDNN libraries that come preinstalled on the EC2 instance. Removing the ReLU in `EmbConv1DModel` as well as boosting the Conv1D weights will fix the error. Replacing the ReLU with `Threshold(0, 0)` has the same behavior. It depends on the seed, i.e. if the initialization of `Conv1D` is favorable the bug will not trigger. I tried pinpointing it on `index_copy` but it is not `index_copy` by itself that is the problem. Maybe we will just have to live with this? > `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_dora_ephemeral_gpu_offload_multigpu - RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)` This is caused by a bug introduced in #2960 - `ephemeral_gpu_offload` is not passed to the variant and therefore never utilized. > `FAILED tests/test_gpu_examples.py::PeftBnbGPUExampleTests::test_seq2seq_lm_training_single_gpu - AttributeError: 'T5ForConditionalGeneration' object has no attribute 'hf_device_map'` This is caused by transformers@315dcbe45cee1489a32fc228a80502b0a150936c which disables accelerate hooks if the device map only contains one device. I confirmed that just specifying one value moves the model to that device even without accelerate hook invocation. I also tested having two devices (cpu + cuda:0) and in that case a device map is present. Therefore this only needs an added `hasattr` check to be compatible with transformers v5. Co-authored-by: nemo <git@ningu.net>
Author
Parents
Loading