The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
I left a few comments but all of them very minor in nature. Basically, this PR looks solid to me and it shouldn't take much time to merge.
Off to @yiyixuxu.
465 | 465 | hidden_states = self.proj_out(hidden_states) | |
466 | 466 | ||
467 | 467 | # 5. Unpatchify | |
468 | # Note: we use `-1` instead of `channels`: | ||
469 | # - It is okay to use for CogVideoX-2b and CogVideoX-5b (number of input channels is equal to output channels) | ||
470 | # - However, for CogVideoX-5b-I2V, input image (number of input channels is twice the output channels) |
I think this is sufficiently supplemented with a comment, it should be fine!
764 | image_rotary_emb=image_rotary_emb, | ||
765 | return_dict=False, | ||
766 | )[0] | ||
767 | noise_pred = noise_pred.float() |
This seems interesting. Why do we have to manually perform the upcasting here?
I think @yiyixuxu would better be able to answer this since it was copied over from other Cog pipelines. IIRC, the original codebase had an upcast here too which is why we kept it too
thanks! left some minor comments, feel free to merge once addressed!
751 | latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) | ||
752 | |||
753 | latent_image_input = torch.cat([image_latents] * 2) if do_classifier_free_guidance else image_latents | ||
754 | latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=2) |
interesting, they don't add noise to the image
78 | 84 | "mixins.final_layer.norm_final": "norm_out.norm", | |
79 | 85 | "mixins.final_layer.linear": "proj_out", | |
80 | 86 | "mixins.final_layer.adaLN_modulation.1": "norm_out.linear", | |
87 | "mixins.pos_embed.pos_embedding": "patch_embed.pos_embedding", # Specific to CogVideoX-5b-I2V |
Should we have any if/else to guard that accordingly?
This layer is absent in the T2V models actually. It's called positional_embedding
in T2V which is just sincos PE, while pos_embedding
here. I think it's safe but going to verify it now
Yep, this is safe and should not affect the T2V checkpoints since they follow different layer naming conventions
421 | if self.use_positional_embeddings or self.use_learned_positional_embeddings: | ||
422 | if self.use_learned_positional_embeddings and (self.sample_width != width or self.sample_height != height): | ||
423 | raise ValueError( | ||
424 | "It is currently not possible to generate videos at a different resolution that the defaults. This should only be the case with 'THUDM/CogVideoX-5b-I2V'." |
In other words, the 2b variant supports it?
Yes, we had some success with multiresolution inference quality on 2B T2V. The reason for allowing this is to not confine lora training to 720x480 videos on 2B model. 5B T2V will skip this entire branch. 5B I2V use positional embeddings that were learned, so we can't generate them on-the-fly like sincos for the 2B T2V model
776 | |||
777 | # perform guidance | ||
778 | if use_dynamic_cfg: | ||
779 | self._guidance_scale = 1 + guidance_scale * ( | ||
780 | (1 - math.cos(math.pi * ((num_inference_steps - t.item()) / num_inference_steps) ** 5.0)) / 2 | ||
781 | ) |
(can revisit later)
This can introduce graph-breaks because we are combining non-torch operations with torch tensors. .item()
is a data-dependent call and can also lead to performance issues.
Just noting so that we can revisit if needs be.
Looks good. My comments are minor, not blockers at all.
Will be merging after CI turns green. Will take up any changes in follow-up PRs
OSError: THUDM/CogVideoX-5b-I2V is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline
Thank you for your support! We expect to open source the project next week. If the release patch can be published before then, it would be a great help to us.
The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline
Login to write a write a comment.
The purpose of this PR is to adapt our upcoming CogVideoX-5B-I2V model to the diffusers framework:
CogVideoXImage2Video
, has been created, and the documentation has been updated accordingly.@a-r-r-o-w @zRzRzRzRzRzRzR