Support transformers 4.43 (#1971)
* fix bt bark test
* setup
* patch clip models for sd
* infer ort model dtype property from inputs dtypes
* patch all clip variants
* device setter
* bigger model for now
* fix device attribution
* onnx opset for owlvit and owlv2
* model dtype
* revert
* use model part dtype instead
* no need for dtype with diffusion pipelines
* revert
* fix clip text model with projection not outputting hidden states
* whisper generation
* fix whisper, support cache_position, and using transformers whisper generation loop
* style
* create cache position for merged decoder and fix test for non whisper speech to text
* typo
* conditioned cache position argument
* update whisper min transformers version
* compare whisper ort generation with transformers
* fix generation length for speech to text model type
* cache position in whisper only with dynamic axis decoder_sequence_length
* use minimal prepare_inputs_for_generation in ORTModelForSpeechSeq2Seq
* remove version restrictions on whisper
* comment
* fix
* simpler
---------
Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>