diffusers
0fff459d - Fix ErnieImagePipeline pre-computed prompt_embeds + num_images_per_prompt shape mismatch (#13532)

Commit
66 days ago
Fix ErnieImagePipeline pre-computed prompt_embeds + num_images_per_prompt shape mismatch (#13532) Fix ErnieImagePipeline pre-computed prompt_embeds + num_images_per_prompt When a user passes pre-computed `prompt_embeds` (or `negative_prompt_embeds`) alongside `num_images_per_prompt > 1`, `ErnieImagePipeline.__call__` did not replicate the provided embeddings — the embeds list kept its original length (one per prompt) while the latents were allocated with `total_batch_size = batch_size * num_images_per_prompt`: text_hiddens = prompt_embeds # length = batch_size (NOT replicated) ... latents = randn_tensor((total_batch_size, ...)) # batch * N in shape In the denoise loop `text_bth.shape[0]` then mismatches `latent_model_input.shape[0]`, so the transformer call: pred = self.transformer( hidden_states=latent_model_input, # (batch*N*2, ...) under CFG text_bth=text_bth, # (batch*2, ...) ... ) fails with a shape mismatch inside the attention block. The standard "pre-compute embeds once, generate N variants" usage pattern is broken. `encode_prompt` already performs this replication internally (`for _ in range(num_images_per_prompt): text_hiddens.append(hidden)` at lines 158-160), so the non-embed path is unaffected — this only impacts callers of the documented `prompt_embeds` / `negative_prompt_embeds` arguments. Mirror the replication logic in the pre-embed branches so both paths yield a `text_hiddens` list of length `batch_size * num_images_per_prompt`.
Author
Parents
Loading