Add AudioLDM 2 (#4549)
* from audioldm
* unet down + mid
* vae, clap, flan-t5
* start sequence audio mae
* iterate on audioldm encoder
* finish encoder
* finish weight conversion
* text pre-processing
* gpt2 pre-processing
* fix projection model
* working
* unet equivalence
* finish in base
* add unet cond
* finish unet
* finish custom unet
* start clean-up
* revert base unet changes
* refactor pre-processing
* tests: from audioldm
* fix some tests
* more fixes
* iterate on tests
* make fix copies
* harden fast tests
* slow integration tests
* finish tests
* update checkpoint
* update copyright
* docs
* remove outdated method
* add docstring
* make style
* remove decode latents
* enable cpu offload
* (text_encoder_1, tokenizer_1) -> (text_encoder, tokenizer)
* more clean up
* more refactor
* build pr docs
* Update docs/source/en/api/pipelines/audioldm2.md
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* small clean
* tidy conversion
* update for large checkpoint
* generate -> generate_language_model
* full clap model
* shrink clap-audio in tests
* fix large integration test
* fix fast tests
* use generation config
* make style
* update docs
* finish docs
* finish doc
* update tests
* fix last test
* syntax
* finalise tests
* refactor projection model in prep for TTS
* fix fast tests
* style
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>