feat: Add Motif-Video model and pipelines (#13551)
* feat: add Motif Video T2V and I2V pipelines with AdaptiveProjectedGuidance support
Add complete Motif Video implementation to diffusers:
New Models:
- Add MotifVideoTransformer3DModel with T5Gemma2Encoder for multimodal conditioning
- Supports text-to-video and image-to-video generation with vision tower integration
New Pipelines:
- Add MotifVideoPipeline for text-to-video generation
- Default resolution: 736x1280, 121 frames, 25 fps
- Supports classifier-free guidance and AdaptiveProjectedGuidance
- Add MotifVideoImage2VideoPipeline for image-to-video generation
- First frame conditioning with vision encoder
- Same defaults as T2V pipeline
Enhanced Guidance:
- Update AdaptiveProjectedGuidance with normalization_dims parameter
- Support "spatial" normalization for 5D tensors (per-frame spatial normalization)
- Support custom dimension lists for flexible normalization
- Update AdaptiveProjectedMixGuidance with same parameter
Documentation & Tests:
- Add comprehensive API documentation for transformer and pipelines
- Add test suites for both T2V and I2V pipelines
- Register all new components in __init__ files
- Add dummy objects for torch and transformers backends
Total: 18 files changed, 3416 insertions(+), 2 deletions(-)
* Remove linear quadratic
* Remove musicldm
* Update docstring
* Address vision_encoder comment
* Add copy source in I2V pippeline
* Refactor _get_prompt_embeds
Co-authored-by: Beomgyu Kim <beomgyu.kim@motiftech.io>
* Fix a typo
* Refactor MotifVideo transformer to use diffusers Attention conventions
- Use default Attention class with custom MotifVideoAttnProcessor2_0
- Inline cross-attention in transformer blocks
- Use dispatch_attention_fn for backend support
- Inherit AttentionMixin for attn_processors/set_attn_processor
- Move TransformerBlockRegistry to _helpers.py
- Add _repeated_blocks for regional compilation
* Use base classes for scheduler and guider
* Implement MotifVideoAttention
* Update style and quality
* Fix a typo
* Fix a typo
* Fix a typo
* Update year
* Address rope dtype
* Update docstring and remove frame_rate
* Address unused sigmas
* Add available processors
* Address copy from comment
* Remove torch.no_grad()
* Remove use_attention_mask
* Address inline cross-attention
* Address compute dtype
* Remove unused variables
* Merge main APG into this branch and update documentation
* Refactor cross attention processor
* Remove unused timestep
* Inline create_attention_mask
* Make guider required
* Address encode_prompt comment
* Address preprocess_video comment
* Use T5Gemma2Encoder in test cases
* Address None feature_extractor
* Address output type
* Renable skipped tests
* Update style and quality
* Generate standard transformer test case
* Add model test case
* Remove guider in documentation
* Implement cross_attn layer
* Remove prepare_negative_prompt
* Address latent is None
* Clean up feature_extractor
* Fix prepare_latents
* Remove transformers assertion
* Fix style and quality
* Fix python utils/check_copies.py --fix_and_overwrite
python utils/check_dummies.py --fix_and_overwrite outputs
* Add dropout rate to text config
* Skip tests requiring guidance_scale
* Fix encode_prompt in test cases
* Fix test_cpu_offload_forward_pass_twice
* Update tests/pipelines/motif_video/test_motif_video.py
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
* Update tests/pipelines/motif_video/test_motif_video.py
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
* Update tests/pipelines/motif_video/test_motif_video.py
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
* Update tests/pipelines/motif_video/test_motif_video_image2video.py
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
* Address test_attention_slicing_forward_pass comment
* Update tests/pipelines/motif_video/test_motif_video_image2video.py
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
* Update tests/pipelines/motif_video/test_motif_video_image2video.py
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
* Update tests/pipelines/motif_video/test_motif_video_image2video.py
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
* Skip I2V test cases
* Fix style and quality
* Add docs to toctree
* Fix docs location in toctree and add link in overview
* Inline gradient checkpointing
* Add _keep_in_fp32_modules for timestep_embedder
* Address num_decoder_layers comment
* Address guider is not None comment
* Remove _keep_in_fp32_modules
* Address parameter_dtype comment
---------
Co-authored-by: Ken Cheung <ken.cheung@motiftech.io>
Co-authored-by: Beomgyu Kim <beomgyu.kim@motiftech.io>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>