Fix CLIPOutput attentions not being returned (#43657)
* Fix CLIPOutput attentions not being returned
Fixes #43618
CLIPVisionTransformer and CLIPTextTransformer were not passing through
the 'hidden_states' and 'attentions' from the encoder outputs to the
BaseModelOutputWithPooling return value.
This regression in v5 meant that users could no longer access attention
weights via:
vision_outputs = model.vision_model(pixel_values, output_attentions=True)
vision_outputs.attentions # was None, should contain attention weights
The fix adds the missing fields to the return statements in both
transformers.
* Regenerate metaclip_2 modeling to include hidden_states and attentions