transformers
30d8919a - in the resize() function in image_transforms.py, the line 267: (#20728)

Commit

3 years ago

in the resize() function in image_transforms.py, the line 267: (#20728) `image = to_channel_dimension_format(image, ChannelDimension.LAST)` is redundant as this same conversion is also applied in to_pil_image(). This redundant call actually makes the training fail in rare cases. The problem can be reproduced with the following code snippet: ``` from transformers.models.clip import CLIPFeatureExtractor vision_processor = CLIPFeatureExtractor.from_pretrained('openai/clip-vit-large-patch14') images = [ torch.rand(size=(3, 2, 10), dtype=torch.float), torch.rand(size=(3, 10, 1), dtype=torch.float), torch.rand(size=(3, 1, 10), dtype=torch.float) ] for image in images: processed_image = vision_processor(images=image, return_tensors="pt")['pixel_values'] print(processed_image.shape) assert processed_image.shape == torch.Size([1, 3, 224, 224]) ``` The last image has a height of 1 pixel. The second call to to_channel_dimesion_format() will transpose the image, and the height dimension is wrongly treated as the channels dimension afterwards. Because of this, the following normalize() step will result in an exception.

References

#20728 - Redundant to_channel_dimension_format() call makes preprocessing fail in case the image has height of 1 pixel

#27720 - Add common processor tests

#29969 - [SigLIP] Add fast tokenizer

#32831 - [Docs] Update resources

#33111 - [Backbone] Remove out_features everywhere

#33174 - [Zero-shot image classification pipeline] Remove tokenizer_kwargs

#39821 - Support MetaCLIP 2

#59 - Fix attention mask handling in EoMT-DINOv3 converter

#41212 - Add EoMT with DINOv3 backbone

#62 - Add initial DEIMv2 model implementation

Author

dhansmair

Parents

4f1788b3

transformers 30d8919a - in the resize() function in image_transforms.py, the line 267: (#20728)

transformers
30d8919a - in the resize() function in image_transforms.py, the line 267: (#20728)