PR #9418 CogVideoX-5b-I2V support

zRzRzRzRzRzRzR349 days ago (edited 347 days ago)🎉 11❤ 12🚀 10

The purpose of this PR is to adapt our upcoming CogVideoX-5B-I2V model to the diffusers framework:

The model takes an image and text as input and outputs a video.
The in-channel of the model has been modified to 32, while the rest of the model structure is similar to the 5B T2V.
A new pipeline, CogVideoXImage2Video, has been created, and the documentation has been updated accordingly.

@a-r-r-o-w @zRzRzRzRzRzRzR

draft Init

6e3ae045

draft

ad78738a

vae encode image

8966671c

Merge branch 'huggingface:main' into cogvideox-5b-i2v

a56c5106

make style

c238fe28

image latents preparation

c1f7a800

remove image encoder from conversion script

3df95b2c

fix minor bugs

677a5530

make pipeline work

4f518298

make style

33c7cd6b

remove debug prints

bc07f9f0

fix imports

98f10238

update example

aa12e1b5

make fix-copies

1970f4fa

add fast tests

e044850c

Merge branch 'main' into cogvideox-5b-i2v

f7d8e37c

a-r-r-o-w requested a review from

yiyixuxu 348 days ago

a-r-r-o-w requested a review from

sayakpaul 348 days ago

fix import

9f6f3f64

HuggingFaceDocBuilderDev348 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul commented on 2024-09-13

sayakpaul348 days ago

I left a few comments but all of them very minor in nature. Basically, this PR looks solid to me and it shouldn't take much time to merge.

Off to @yiyixuxu.

Conversation is marked as resolved

Show resolved

src/diffusers/models/transformers/cogvideox_transformer_3d.py

465	465	hidden_states = self.proj_out(hidden_states)
466	466
467	467	# 5. Unpatchify
	468	# Note: we use `-1` instead of `channels`:
	469	# - It is okay to use for CogVideoX-2b and CogVideoX-5b (number of input channels is equal to output channels)
	470	# - However, for CogVideoX-5b-I2V, input image (number of input channels is twice the output channels)

sayakpaul348 days ago

I think this is sufficiently supplemented with a comment, it should be fine!

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

	764		image_rotary_emb=image_rotary_emb,
	765		return_dict=False,
	766		)[0]
	767		noise_pred = noise_pred.float()

sayakpaul348 days ago

This seems interesting. Why do we have to manually perform the upcasting here?

a-r-r-o-w348 days ago (edited 348 days ago)

I think @yiyixuxu would better be able to answer this since it was copied over from other Cog pipelines. IIRC, the original codebase had an upcast here too which is why we kept it too

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

yiyixuxu approved these changes on 2024-09-13

yiyixuxu348 days ago

thanks! left some minor comments, feel free to merge once addressed!

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

	751		latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
	752
	753		latent_image_input = torch.cat([image_latents] * 2) if do_classifier_free_guidance else image_latents
	754		latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=2)

yiyixuxu348 days ago👀 1

interesting, they don't add noise to the image

update vae

877cdc0c

update docs

29f10070

update image link

0c1358c4

apply suggestions from review

8222a55f

Merge branch 'main' into cogvideox-5b-i2v

61831bd3

a-r-r-o-w commented on 2024-09-13

Conversation is marked as resolved

Show resolved

apply suggestions from review

2d8dce9d

add slow test

4f894269

make use of learned positional embeddings

21a6f79b

a-r-r-o-w requested a review from

sayakpaul 347 days ago

sayakpaul commented on 2024-09-13

scripts/convert_cogvideox_to_diffusers.py

78	84	"mixins.final_layer.norm_final": "norm_out.norm",
79	85	"mixins.final_layer.linear": "proj_out",
80	86	"mixins.final_layer.adaLN_modulation.1": "norm_out.linear",
	87	"mixins.pos_embed.pos_embedding": "patch_embed.pos_embedding", # Specific to CogVideoX-5b-I2V

sayakpaul347 days ago

Should we have any if/else to guard that accordingly?

a-r-r-o-w347 days ago

This layer is absent in the T2V models actually. It's called positional_embedding in T2V which is just sincos PE, while pos_embedding here. I think it's safe but going to verify it now

a-r-r-o-w347 days ago

Yep, this is safe and should not affect the T2V checkpoints since they follow different layer naming conventions

sayakpaul commented on 2024-09-13

src/diffusers/models/embeddings.py

	421		if self.use_positional_embeddings or self.use_learned_positional_embeddings:
	422		if self.use_learned_positional_embeddings and (self.sample_width != width or self.sample_height != height):
	423		raise ValueError(
	424		"It is currently not possible to generate videos at a different resolution that the defaults. This should only be the case with 'THUDM/CogVideoX-5b-I2V'."

sayakpaul347 days ago

In other words, the 2b variant supports it?

a-r-r-o-w347 days ago

Yes, we had some success with multiresolution inference quality on 2B T2V. The reason for allowing this is to not confine lora training to 720x480 videos on 2B model. 5B T2V will skip this entire branch. 5B I2V use positional embeddings that were learned, so we can't generate them on-the-fly like sincos for the 2B T2V model

sayakpaul commented on 2024-09-13

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-13

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-13

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

	776
	777		# perform guidance
	778		if use_dynamic_cfg:
	779		self._guidance_scale = 1 + guidance_scale * (
	780		(1 - math.cos(math.pi * ((num_inference_steps - t.item()) / num_inference_steps) ** 5.0)) / 2
	781		)

sayakpaul347 days ago❤ 1

(can revisit later)

This can introduce graph-breaks because we are combining non-torch operations with torch tensors. .item() is a data-dependent call and can also lead to performance issues.

Just noting so that we can revisit if needs be.

sayakpaul commented on 2024-09-13

Conversation is marked as resolved

Show resolved

sayakpaul approved these changes on 2024-09-13

sayakpaul347 days ago

Looks good. My comments are minor, not blockers at all.

apply suggestions from review

6ce07784

Merge branch 'huggingface:main' into cogvideox-5b-i2v

7e637d6c

doc change

6f313e85

zRzRzRzRzRzRzR changed the title ~~Cogvideox 5b i2v draft~~ CogVideoX-5b-I2V support 347 days ago

Merge branch 'main' into cogvideox-5b-i2v

ed8bda96

Update convert_cogvideox_to_diffusers.py

c8ec68ca

make style

33056c54

final changes

6dc9bdb5

a-r-r-o-w345 days ago

Will be merging after CI turns green. Will take up any changes in follow-up PRs

make style

edeb626f

fix tests

380a820c

a-r-r-o-w merged 8336405e into main 345 days ago

tin2tin344 days ago

OSError: THUDM/CogVideoX-5b-I2V is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

a-r-r-o-w344 days ago

The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline

zRzRzRzRzRzRzR344 days ago

Thank you for your support! We expect to open source the project next week. If the release patch can be published before then, it would be a great help to us.

The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline

zRzRzRzRzRzRzR deleted the cogvideox-5b-i2v branch 225 days ago

1089	1089	return self.tiled_encode(x)
1090	1090
1091	1091	frame_batch_size = self.num_sample_frames_batch_size
	1092	num_batches = num_frames // frame_batch_size if num_frames > 1 else 1
1092	1093	enc = []
1093		for i in range(num_frames // frame_batch_size):
	1094	for i in range(num_batches):

	434		extra_step_kwargs["generator"] = generator
	435		return extra_step_kwargs
	436
	437		def check_inputs(
	438		self,
	439		prompt,
	440		height,
	441		width,
	442		negative_prompt,
	443		callback_on_step_end_tensor_inputs,
	444		video=None,
	445		latents=None,
	446		prompt_embeds=None,
	447		negative_prompt_embeds=None,
	448		):

	490		if video is not None and latents is not None:
	491		raise ValueError("Only one of `video` or `latents` should be provided")
	492
	493		# Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.fuse_qkv_projections
	494		def fuse_qkv_projections(self) -> None:
	495		r"""Enables fused QKV projections."""
	496		self.fusing_transformer = True
	497		self.transformer.fuse_qkv_projections()
	498
	499		# Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.unfuse_qkv_projections
	500		def unfuse_qkv_projections(self) -> None:
	501		r"""Disable QKV projection fusion if enabled."""
	502		if not self.fusing_transformer:
	503		logger.warning("The Transformer was not initially fused for QKV projections. Doing nothing.")
	504		else:
	505		self.transformer.unfuse_qkv_projections()
	506		self.fusing_transformer = False

98	98	- all
99	99	- __call__
100	100
	101	## CogVideoXImageToVideoPipeline
	102
	103	[[autodoc]] CogVideoXImageToVideoPipeline
	104	- all
	105	- __call__

	283		"VAE tiling should not affect the inference results",
	284		)
	285
	286		@unittest.skip("xformers attention processor does not exist for CogVideoX")
	287		def test_xformers_attention_forwardGenerator_pass(self):
	288		pass

diffusers
CogVideoX-5b-I2V support
#9418

Merged

CogVideoX-5b-I2V support #9418

	49		>>> pipe.to("cuda")
	50
	51		>>> prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
	52		>>> image = load_image("astronaut.jpg") # TODO: Add link to 720x480 image from HF Docs repo

	323		), "Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."
	324		assert np.allclose(
	325		original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2
	326		), "Original outputs should match when fused QKV projections are disabled."

	360		f" size of {batch_size}. Make sure the batch size matches the length of the generators."
	361		)
	362
	363		assert image.ndim == 4

	363		assert image.ndim == 4
	364		image = image.unsqueeze(2) # [B, C, F, H, W]
	365
	366		if isinstance(generator, list):
	367		if len(generator) != batch_size:
	368		raise ValueError(
	369		f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
	370		f" size of {batch_size}. Make sure the batch size matches the length of the generators."
	371		)

-        if isinstance(generator, list):
-            if len(generator) != batch_size:
-                raise ValueError(
-                    f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                    f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-                )

90	96	"freqs_cos": remove_keys_inplace,
91	97	"position_embedding": remove_keys_inplace,
	98	# TODO zRzRzRzRzRzRzR: really need to remove?
	99	"pos_embedding": remove_keys_inplace,

235	235	spatial_interpolation_scale: float = 1.875,
236	236	temporal_interpolation_scale: float = 1.0,
237	237	use_rotary_positional_embeddings: bool = False,
	238	use_learned_positional_embeddings: bool = False,

	53		>>> image = load_image(
	54		... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
	55		... )
	56		>>> video = pipe(image, prompt, use_dynamic_cfg=True)

	63
	64		def get_dummy_components(self):
	65		torch.manual_seed(0)
	66		transformer = CogVideoXTransformer3DModel(

diffusers CogVideoX-5b-I2V support #9418 Merged

CogVideoX-5b-I2V support #9418

diffusers
CogVideoX-5b-I2V support
#9418

Merged