LTX-2

LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

You can find all the original LTX-Video checkpoints under the Lightricks organization.

The original codebase for LTX-2 can be found here.

LTX2Pipeline

class diffusers.LTX2Pipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLLTX2Video audio_vae: AutoencoderKLLTX2Audio text_encoder: Gemma3ForConditionalGeneration tokenizer: typing.Union[transformers.models.gemma.tokenization_gemma.GemmaTokenizer, transformers.models.gemma.tokenization_gemma_fast.GemmaTokenizerFast] connectors: LTX2TextConnectors transformer: LTX2VideoTransformer3DModel vocoder: LTX2Vocoder )

Parameters

transformer (LTXVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
vae (AutoencoderKLLTXVideo) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (T5EncoderModel) — T5, specifically the google/t5-v1_1-xxl variant.
tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
tokenizer (T5TokenizerFast) — Second Tokenizer of class T5TokenizerFast.
connectors (LTX2TextConnectors) — Text connector stack used to adapt text encoder hidden states for the video and audio branches.

Pipeline for text-to-video generation.

Reference: https://github.com/Lightricks/LTX-Video

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 512 width: int = 768 num_frames: int = 121 frame_rate: float = 24.0 num_inference_steps: int = 40 timesteps: typing.List[int] = None guidance_scale: float = 4.0 guidance_rescale: float = 0.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None audio_latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None decode_timestep: typing.Union[float, typing.List[float]] = 0.0 decode_noise_scale: typing.Union[float, typing.List[float], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 1024 ) → ~pipelines.ltx.LTX2PipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
height (int, optional, defaults to 512) — The height in pixels of the generated image. This is set to 480 by default for the best results.
width (int, optional, defaults to 768) — The width in pixels of the generated image. This is set to 848 by default for the best results.
num_frames (int, optional, defaults to 121) — The number of video frames to generate
frame_rate (float, optional, defaults to 24.0) — The frames per second (FPS) of the generated video.
num_inference_steps (int, optional, defaults to 40) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
timesteps (List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
guidance_scale (float, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
guidance_rescale (float, optional, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawed guidance_scale is defined as φ in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.
num_videos_per_prompt (int, optional, defaults to 1) — The number of videos to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
audio_latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for audio generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
negative_prompt_attention_mask (torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings.
decode_timestep (float, defaults to 0.0) — The timestep at which generated video is decoded.
decode_noise_scale (float, defaults to None) — The interpolation factor between random noise and denoised latents at the decode timestep.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.ltx.LTX2PipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional, defaults to ["latents"]) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.
max_sequence_length (int, optional, defaults to 1024) — Maximum sequence length to use with the prompt.

Returns

~pipelines.ltx.LTX2PipelineOutput or tuple

If return_dict is True, ~pipelines.ltx.LTX2PipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import LTX2Pipeline
>>> from diffusers.pipelines.ltx2.export_utils import encode_video

>>> pipe = LTX2Pipeline.from_pretrained("Lightricks/LTX-2", torch_dtype=torch.bfloat16)
>>> pipe.enable_model_cpu_offload()

>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

>>> frame_rate = 24.0
>>> video, audio = pipe(
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     width=768,
...     height=512,
...     num_frames=121,
...     frame_rate=frame_rate,
...     num_inference_steps=40,
...     guidance_scale=4.0,
...     output_type="np",
...     return_dict=False,
... )
>>> video = (video * 255).round().astype("uint8")
>>> video = torch.from_numpy(video)

>>> encode_video(
...     video[0],
...     fps=frame_rate,
...     audio=audio[0].float().cpu(),
...     audio_sample_rate=pipe.vocoder.config.output_sampling_rate,  # should be 24000
...     output_path="video.mp4",
... )

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 1024 scale_factor: int = 8 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

prompt (str or List[str], optional) — prompt to be encoded
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
num_videos_per_prompt (int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
device — (torch.device, optional): torch device
dtype — (torch.dtype, optional): torch dtype

Encodes the prompt into text encoder hidden states.

LTX2ImageToVideoPipeline

class diffusers.LTX2ImageToVideoPipeline

< source >

Pipeline for image-to-video generation.

Reference: https://github.com/Lightricks/LTX-Video

TODO

call

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 512 width: int = 768 num_frames: int = 121 frame_rate: float = 24.0 num_inference_steps: int = 40 timesteps: typing.List[int] = None guidance_scale: float = 4.0 guidance_rescale: float = 0.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None audio_latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None decode_timestep: typing.Union[float, typing.List[float]] = 0.0 decode_noise_scale: typing.Union[float, typing.List[float], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 1024 ) → ~pipelines.ltx.LTX2PipelineOutput or tuple

Parameters

image (PipelineImageInput) — The input image to condition the generation on. Must be an image, a list of images or a torch.Tensor.
prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
height (int, optional, defaults to 512) — The height in pixels of the generated image. This is set to 480 by default for the best results.
width (int, optional, defaults to 768) — The width in pixels of the generated image. This is set to 848 by default for the best results.
num_frames (int, optional, defaults to 121) — The number of video frames to generate
frame_rate (float, optional, defaults to 24.0) — The frames per second (FPS) of the generated video.
num_inference_steps (int, optional, defaults to 40) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
timesteps (List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
guidance_scale (float, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
guidance_rescale (float, optional, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawed guidance_scale is defined as φ in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.
num_videos_per_prompt (int, optional, defaults to 1) — The number of videos to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
audio_latents (torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for audio generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
negative_prompt_attention_mask (torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings.
decode_timestep (float, defaults to 0.0) — The timestep at which generated video is decoded.
decode_noise_scale (float, defaults to None) — The interpolation factor between random noise and denoised latents at the decode timestep.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.ltx.LTX2PipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.
max_sequence_length (int, optional, defaults to 1024) — Maximum sequence length to use with the prompt.

Returns

~pipelines.ltx.LTX2PipelineOutput or tuple

If return_dict is True, ~pipelines.ltx.LTX2PipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import LTX2Pipeline
>>> from diffusers.pipelines.ltx2.export_utils import encode_video
>>> from diffusers.utils import load_image

>>> pipe = LTX2ImageToVideoPipeline.from_pretrained("Lightricks/LTX-2", torch_dtype=torch.bfloat16)
>>> pipe.enable_model_cpu_offload()

>>> image = load_image(
...     "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
... )
>>> prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background."
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

>>> frame_rate = 24.0
>>> video = pipe(
...     image=image,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     width=768,
...     height=512,
...     num_frames=121,
...     frame_rate=frame_rate,
...     num_inference_steps=40,
...     guidance_scale=4.0,
...     output_type="np",
...     return_dict=False,
... )
>>> video = (video * 255).round().astype("uint8")
>>> video = torch.from_numpy(video)

>>> encode_video(
...     video[0],
...     fps=frame_rate,
...     audio=audio[0].float().cpu(),
...     audio_sample_rate=pipe.vocoder.config.output_sampling_rate,  # should be 24000
...     output_path="video.mp4",
... )

encode_prompt

< source >

Parameters

prompt (str or List[str], optional) — prompt to be encoded
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
num_videos_per_prompt (int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
device — (torch.device, optional): torch device
dtype — (torch.dtype, optional): torch dtype

Encodes the prompt into text encoder hidden states.

LTX2LatentUpsamplePipeline

class diffusers.LTX2LatentUpsamplePipeline

< source >

( vae: AutoencoderKLLTX2Video latent_upsampler: LTX2LatentUpsamplerModel )

call

< source >

( video: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None height: int = 512 width: int = 768 num_frames: int = 121 spatial_patch_size: int = 1 temporal_patch_size: int = 1 latents: typing.Optional[torch.Tensor] = None latents_normalized: bool = False decode_timestep: typing.Union[float, typing.List[float]] = 0.0 decode_noise_scale: typing.Union[float, typing.List[float], NoneType] = None adain_factor: float = 0.0 tone_map_compression_ratio: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) → ~pipelines.ltx.LTXPipelineOutput or tuple

Parameters

video (List[PipelineImageInput], optional) — The video to be upsampled (such as a LTX 2.0 first stage output). If not supplied, latents should be supplied.
height (int, optional, defaults to 512) — The height in pixels of the input video (not the generated video, which will have a larger resolution).
width (int, optional, defaults to 768) — The width in pixels of the input video (not the generated video, which will have a larger resolution).
num_frames (int, optional, defaults to 121) — The number of frames in the input video.
spatial_patch_size (int, optional, defaults to 1) — The spatial patch size of the video latents. Used when latents is supplied if unpacking is necessary.
temporal_patch_size (int, optional, defaults to 1) — The temporal patch size of the video latents. Used when latents is supplied if unpacking is necessary.
latents (torch.Tensor, optional) — Pre-generated video latents. This can be supplied in place of the video argument. Can either be a patch sequence of shape (batch_size, seq_len, hidden_dim) or a video latent of shape (batch_size, latent_channels, latent_frames, latent_height, latent_width).
latents_normalized (bool, optional, defaults to False) — If latents are supplied, whether the latents are normalized using the VAE latent mean and std. If True, the latents will be denormalized before being supplied to the latent upsampler.
decode_timestep (float, defaults to 0.0) — The timestep at which generated video is decoded.
decode_noise_scale (float, defaults to None) — The interpolation factor between random noise and denoised latents at the decode timestep.
adain_factor (float, optional, defaults to 0.0) — Adaptive Instance Normalization (AdaIN) blending factor between the upsampled and original latents. Should be in [-10.0, 10.0]; supplying 0.0 (the default) means that AdaIN is not performed.
tone_map_compression_ratio (float, optional, defaults to 0.0) — The compression strength for tone mapping, which will reduce the dynamic range of the latent values. This is useful for regularizing high-variance latents or for conditioning outputs during generation. Should be in [0, 1], where 0.0 (the default) means tone mapping is not applied and 1.0 corresponds to the full compression effect.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.ltx.LTXPipelineOutput instead of a plain tuple.

Returns

~pipelines.ltx.LTXPipelineOutput or tuple

If return_dict is True, ~pipelines.ltx.LTXPipelineOutput is returned, otherwise a tuple is returned where the first element is the upsampled video.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import LTX2ImageToVideoPipeline, LTX2LatentUpsamplePipeline
>>> from diffusers.pipelines.ltx2.export_utils import encode_video
>>> from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
>>> from diffusers.utils import load_image

>>> pipe = LTX2ImageToVideoPipeline.from_pretrained("Lightricks/LTX-2", torch_dtype=torch.bfloat16)
>>> pipe.enable_model_cpu_offload()

>>> image = load_image(
...     "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
... )
>>> prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background."
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

>>> frame_rate = 24.0
>>> video, audio = pipe(
...     image=image,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     width=768,
...     height=512,
...     num_frames=121,
...     frame_rate=frame_rate,
...     num_inference_steps=40,
...     guidance_scale=4.0,
...     output_type="pil",
...     return_dict=False,
... )

>>> latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
...     "Lightricks/LTX-2", subfolder="latent_upsampler", torch_dtype=torch.bfloat16
... )
>>> upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
>>> upsample_pipe.vae.enable_tiling()
>>> upsample_pipe.to(device="cuda", dtype=torch.bfloat16)

>>> video = upsample_pipe(
...     video=video,
...     width=768,
...     height=512,
...     output_type="np",
...     return_dict=False,
... )[0]
>>> video = (video * 255).round().astype("uint8")
>>> video = torch.from_numpy(video)

>>> encode_video(
...     video[0],
...     fps=frame_rate,
...     audio=audio[0].float().cpu(),
...     audio_sample_rate=pipe.vocoder.config.output_sampling_rate,  # should be 24000
...     output_path="video.mp4",
... )

adain_filter_latent

< source >

( latents: Tensor reference_latents: Tensor factor: float = 1.0 ) → torch.Tensor

Parameters

latent (torch.Tensor) — Input latents to normalize
reference_latents (torch.Tensor) — The reference latents providing style statistics.
factor (float) — Blending factor between original and transformed latent. Range: -10.0 to 10.0, Default: 1.0

Returns

torch.Tensor

The transformed latent tensor

Applies Adaptive Instance Normalization (AdaIN) to a latent tensor based on statistics from a reference latent tensor.

tone_map_latents

< source >

( latents: Tensor compression: float )

Parameters

latents — torch.Tensor Input latent tensor with arbitrary shape. Expected to be roughly in [-1, 1] or [0, 1] range.
compression — float Compression strength in the range [0, 1].
- 0.0: No tone-mapping (identity transform)
- 1.0: Full compression effect

Applies a non-linear tone-mapping function to latent values to reduce their dynamic range in a perceptually smooth way using a sigmoid-based compression.

This is useful for regularizing high-variance latents or for conditioning outputs during generation, especially when controlling dynamic behavior with a compression factor.

LTX2PipelineOutput

class diffusers.pipelines.ltx2.pipeline_output.LTX2PipelineOutput

< source >

( frames: Tensor audio: Tensor )

Parameters

frames (torch.Tensor, np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape (batch_size, num_frames, channels, height, width).
audio (torch.Tensor, np.ndarray) — TODO

Output class for LTX pipelines.

Update on GitHub

Diffusers

LTX-2

LTX2Pipeline

class diffusers.LTX2Pipeline

__call__

encode_prompt

LTX2ImageToVideoPipeline

class diffusers.LTX2ImageToVideoPipeline

__call__

encode_prompt

LTX2LatentUpsamplePipeline

class diffusers.LTX2LatentUpsamplePipeline

__call__

adain_filter_latent

tone_map_latents

LTX2PipelineOutput

class diffusers.pipelines.ltx2.pipeline_output.LTX2PipelineOutput

call

call

call