in the resize() function in image_transforms.py, the line 267: (#20728)
`image = to_channel_dimension_format(image, ChannelDimension.LAST)`
is redundant as this same conversion is also applied in to_pil_image().
This redundant call actually makes the training fail in rare cases.
The problem can be reproduced with the following code snippet:
```
from transformers.models.clip import CLIPFeatureExtractor
vision_processor = CLIPFeatureExtractor.from_pretrained('openai/clip-vit-large-patch14')
images = [
torch.rand(size=(3, 2, 10), dtype=torch.float),
torch.rand(size=(3, 10, 1), dtype=torch.float),
torch.rand(size=(3, 1, 10), dtype=torch.float)
]
for image in images:
processed_image = vision_processor(images=image, return_tensors="pt")['pixel_values']
print(processed_image.shape)
assert processed_image.shape == torch.Size([1, 3, 224, 224])
```
The last image has a height of 1 pixel.
The second call to to_channel_dimesion_format() will transpose the image, and the height
dimension is wrongly treated as the channels dimension afterwards.
Because of this, the following normalize() step will result in an
exception.