Diffusers documentation
QwenImageTransformer2DModel
QwenImageTransformer2DModel
The model can be loaded with the following code snippet.
from diffusers import QwenImageTransformer2DModel
transformer = QwenImageTransformer2DModel.from_pretrained("Qwen/QwenImage-20B", subfolder="transformer", torch_dtype=torch.bfloat16)QwenImageTransformer2DModel
class diffusers.QwenImageTransformer2DModel
< source >( patch_size: int = 2 in_channels: int = 64 out_channels: typing.Optional[int] = 16 num_layers: int = 60 attention_head_dim: int = 128 num_attention_heads: int = 24 joint_attention_dim: int = 3584 guidance_embeds: bool = False axes_dims_rope: typing.Tuple[int, int, int] = (16, 56, 56) zero_cond_t: bool = False use_additional_t_cond: bool = False use_layer3d_rope: bool = False )
Parameters
- patch_size (
int, defaults to2) — Patch size to turn the input data into small patches. - in_channels (
int, defaults to64) — The number of channels in the input. - out_channels (
int, optional, defaults toNone) — The number of channels in the output. If not specified, it defaults toin_channels. - num_layers (
int, defaults to60) — The number of layers of dual stream DiT blocks to use. - attention_head_dim (
int, defaults to128) — The number of dimensions to use for each attention head. - num_attention_heads (
int, defaults to24) — The number of attention heads to use. - joint_attention_dim (
int, defaults to3584) — The number of dimensions to use for the joint attention (embedding/channel dimension ofencoder_hidden_states). - guidance_embeds (
bool, defaults toFalse) — Whether to use guidance embeddings for guidance-distilled variant of the model. - axes_dims_rope (
Tuple[int], defaults to(16, 56, 56)) — The dimensions to use for the rotary positional embeddings.
The Transformer model introduced in Qwen.
forward
< source >( hidden_states: Tensor encoder_hidden_states: Tensor = None encoder_hidden_states_mask: Tensor = None timestep: LongTensor = None img_shapes: typing.Optional[typing.List[typing.Tuple[int, int, int]]] = None txt_seq_lens: typing.Optional[typing.List[int]] = None guidance: Tensor = None attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_block_samples = None additional_t_cond = None return_dict: bool = True )
Parameters
- hidden_states (
torch.Tensorof shape(batch_size, image_sequence_length, in_channels)) — Inputhidden_states. - encoder_hidden_states (
torch.Tensorof shape(batch_size, text_sequence_length, joint_attention_dim)) — Conditional embeddings (embeddings computed from the input conditions such as prompts) to use. - encoder_hidden_states_mask (
torch.Tensorof shape(batch_size, text_sequence_length), optional) — Mask for the encoder hidden states. Expected to have 1.0 for valid tokens and 0.0 for padding tokens. Used in the attention processor to prevent attending to padding tokens. The mask can have any pattern (not just contiguous valid tokens followed by padding) since it’s applied element-wise in attention. - timestep (
torch.LongTensor) — Used to indicate denoising step. - img_shapes (
List[Tuple[int, int, int]], optional) — Image shapes for RoPE computation. - txt_seq_lens (
List[int], optional, Deprecated) — Deprecated parameter. Useencoder_hidden_states_maskinstead. If provided, the maximum value will be used to compute RoPE sequence length. - guidance (
torch.Tensor, optional) — Guidance tensor for conditional generation. - attention_kwargs (
dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor. - controlnet_block_samples (optional) — ControlNet block samples to add to the transformer blocks.
- return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple.
The QwenTransformer2DModel forward method.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensorof shape(batch_size, num_channels, height, width)or(batch size, num_vector_embeds - 1, num_latent_pixels)if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_statesinput. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.