Token size limit

#22
by gebaltso - opened

Hello, I would like to ask which is the size limit of the prompt token in sd3. Is it the 2 x 77 or I misunderstood?Thanks in advance.

For now is 77, this is for the three text encoders. There's a PR for only the T5 to be higher which can be as high as 512 but for the clip ones it will still be 77.

hi @gebaltso do you mean 77 for prompt + 77 for negative prompt?
According to code yes it should be 77, but my side it truncates after 75 I don't know why.

The real tokens are 75, the other two are for bos and eos. Also the 2 x 77 means that each clip model uses 77 tokens and since they're two this means 2 x 77.

Because the example prompts has more than 77 tokens, I previously modified diffusers to support T5 512 long token.
But unfortunately this space is rarely used by anyone πŸ˜‚mood.
https://huggingface.co/spaces/vilarin/sd3m-long

this almost works:

from compel import Compel, ReturnedEmbeddingsType

compel = Compel(
truncate_long_prompts=False,
tokenizer=[
pipeline.tokenizer,
pipeline.tokenizer_2
],
text_encoder=[
pipeline.text_encoder,
pipeline.text_encoder_2
],
returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
requires_pooled=[
False,
True
]
)

conditioning, pooled = compel(prompt)
negative_embed, negative_pooled = compel(negative_prompt)
[conditioning, negative_embed] = compel.pad_conditioning_tensors_to_same_length(
[conditioning, negative_embed])

pipe = pipeline(output_type='pil', num_inference_steps=num_inference_steps, num_images_per_prompt=num_images_per_prompt, width=512, height=512,
prompt_embeds=conditioning, pooled_prompt_embeds=pooled, negative_prompt_embeds=negative_embed, negative_pooled_prompt_embeds=negative_pooled).images

For now is 77, this is for the three text encoders. There's a PR for only the T5 to be higher which can be as high as 512 but for the clip ones it will still be 77.

How can I use the T5? Is there an example on how to do that?

*Edit using both prompt and prompt_3 (T5):
image = pipe(
prompt=prompt,
prompt_3=prompt_3,
negative_prompt="",
num_inference_steps=28,
guidance_scale=4.5,
max_sequence_length=512,
).images[0]

the documentation for this its still in the main branch so until the next release, this is the link.

https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_3#using-long-prompts-with-the-t5-text-encoder

If you want to use it with low VRAM there's documentation about it too.

Sign up or log in to comment