[image-to-text pipeline] Add conditional text support + GIT (#23362)
* First draft
* Remove print statements
* Add conditional generation
* Add more tests
* Remove scripts
* Remove BLIP specific linkes
* Add support for pix2struct
* Add fast test
* Address comment
* Fix style