Fix ErnieImagePipeline pre-computed prompt_embeds + num_images_per_prompt shape mismatch (#13532)
Fix ErnieImagePipeline pre-computed prompt_embeds + num_images_per_prompt
When a user passes pre-computed `prompt_embeds` (or `negative_prompt_embeds`)
alongside `num_images_per_prompt > 1`, `ErnieImagePipeline.__call__` did
not replicate the provided embeddings — the embeds list kept its original
length (one per prompt) while the latents were allocated with
`total_batch_size = batch_size * num_images_per_prompt`:
text_hiddens = prompt_embeds # length = batch_size (NOT replicated)
...
latents = randn_tensor((total_batch_size, ...)) # batch * N in shape
In the denoise loop `text_bth.shape[0]` then mismatches
`latent_model_input.shape[0]`, so the transformer call:
pred = self.transformer(
hidden_states=latent_model_input, # (batch*N*2, ...) under CFG
text_bth=text_bth, # (batch*2, ...)
...
)
fails with a shape mismatch inside the attention block. The standard
"pre-compute embeds once, generate N variants" usage pattern is broken.
`encode_prompt` already performs this replication internally
(`for _ in range(num_images_per_prompt): text_hiddens.append(hidden)`
at lines 158-160), so the non-embed path is unaffected — this only
impacts callers of the documented `prompt_embeds` / `negative_prompt_embeds`
arguments.
Mirror the replication logic in the pre-embed branches so both paths
yield a `text_hiddens` list of length `batch_size * num_images_per_prompt`.