Add SVD (#5895)
* begin model
* finish blocks
* add_embedding
* addition_time_embed_dim
* use TimestepEmbedding
* fix temporal res block
* fix time_pos_embed
* fix add_embedding
* add conversion script
* fix model
* up
* add new resnet blocks
* make forward work
* return sample in original shape
* fix temb shape in TemporalResnetBlock
* add spatio temporal transformers
* add vae blocks
* fix blocks
* update
* update
* fix shapes in Alphablender and add time activation in res blcok
* use new blocks
* style
* fix temb shape
* fix SpatioTemporalResBlock
* reuse TemporalBasicTransformerBlock
* fix TemporalBasicTransformerBlock
* use TransformerSpatioTemporalModel
* fix TransformerSpatioTemporalModel
* fix time_context dim
* clean up
* make temb optional
* add blocks
* rename model
* update conversion script
* remove UNetMidBlockSpatioTemporal
* add in init
* remove unused arg
* remove unused arg
* remove more unsed args
* up
* up
* check for None
* update vae
* update up/mid blocks for decoder
* begin pipeline
* adapt scheduler
* add guidance scalings
* fix norm eps in temporal transformers
* add temporal autoencoder
* make pipeline run
* fix frame decodig
* decode in float32
* decode n frames at a time
* pass decoding_t to decode_latents
* fix decode_latents
* vae encode/decode in fp32
* fix dtype in TransformerSpatioTemporalModel
* type image_latents same as image_embeddings
* allow using differnt eps in temporal block for video decoder
* fix default values in vae
* pass num frames in decode
* switch spatial to temporal for mixing in VAE
* fix num frames during split decoding
* cast alpha to sample dtype
* fix attention in MidBlockTemporalDecoder
* fix typo
* fix guidance_scales dtype
* fix missing activation in TemporalDecoder
* skip_post_quant_conv
* add vae conversion
* style
* take guidance scale as input
* up
* allow passing PIL to export_video
* accept fps as arg
* add pipeline and vae in init
* remove hack
* use AutoencoderKLTemporalDecoder
* don't scale image latents
* add unet tests
* clean up unet
* clean TransformerSpatioTemporalModel
* add slow svd test
* clean up
* make temb optional in Decoder mid block
* fix norm eps in TransformerSpatioTemporalModel
* clean up temp decoder
* clean up
* clean up
* use c_noise values for timesteps
* use math for log
* update
* fix copies
* doc
* upcast vae
* update forward pass for gradient checkpointing
* make added_time_ids is tensor
* up
* fix upcasting
* remove post quant conv
* add _resize_with_antialiasing
* fix _compute_padding
* cleanup model
* more cleanup
* more cleanup
* more cleanup
* remove freeu
* remove attn slice
* small clean
* up
* up
* remove extra step kwargs
* remove eta
* remove dropout
* remove callback
* remove merge factor args
* clean
* clean up
* move to dedicated folder
* remove attention_head_dim
* docstr and small fix
* update unet doc strings
* rename decoding_t
* correct linting
* store c_skip and c_out
* cleanup
* clean TemporalResnetBlock
* more cleanup
* clean up vae
* clean up
* begin doc
* more cleanup
* up
* up
* doc
* Improve
* better naming
* better naming
* better naming
* better naming
* better naming
* better naming
* better naming
* better naming
* Apply suggestions from code review
* Default chunk size to None
* add example
* Better
* Apply suggestions from code review
* update doc
* Update src/diffusers/pipelines/stable_diffusion_video/pipeline_stable_diffusion_video.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* style
* Get torch compile working
* up
* rename
* fix doc
* add chunking
* torch compile
* torch compile
* add modelling outputs
* torch compile
* Improve chunking
* Apply suggestions from code review
* Update docs/source/en/using-diffusers/svd.md
* Close diff tag
* remove slicing
* resnet docstr
* add docstr in resnet
* rename
* Apply suggestions from code review
* update tests
* Fix output type latents
* fix more
* fix more
* Update docs/source/en/using-diffusers/svd.md
* fix more
* add pipeline tests
* remove unused arg
* clean up
* make sure get_scaling receives tensors
* fix euler scheduler
* fix get_scalings
* simply euler for now
* remove old test file
* use randn_tensor to create noise
* fix device for rand tensor
* increase expected_max_difference
* fix test_inference_batch_single_identical
* actually fix test_inference_batch_single_identical
* disable test_save_load_float16
* skip test_float16_inference
* skip test_inference_batch_single_identical
* fix test_xformers_attention_forwardGenerator_pass
* Apply suggestions from code review
* update StableVideoDiffusionPipelineSlowTests
* update image
* add diffusers example
* fix more
---------
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: apolinário <joaopaulo.passos@gmail.com>