Elastic model: Fastest self-serving models. mochi-1-preview.

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

XL: Mathematically equivalent neural network, optimized with our DNN compiler.
L: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
M: Faster model, with accuracy degradation less than 1.5%.
S: The fastest model, with accuracy degradation less than 2%.

Goals of Elastic Models:

Provide the fastest models and service for self-hosting.
Provide flexibility in cost vs quality selection for inference.
Provide clear quality and latency benchmarks.
Provide interface of HF libraries: transformers and diffusers with a single line of code.
Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.

It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

Prompt: Timelapse of urban cityscape transitioning from day to night

Number of frames = 100

S	XL	Original

Inference

Compiled versions are currently available only for 163-frame generations, height=480 and width=848. Other versions are not yet accessible. Stay tuned for updates!

To infer our models, you just need to replace diffusers import with elastic_models.diffusers:

import torch
from elastic_models.diffusers import DiffusionPipeline
from diffusers.video_processor import VideoProcessor
from diffusers.utils import export_to_video

mode_name = "genmo/mochi-1-preview"
hf_token = ""
device = torch.device("cuda")
dtype = torch.bfloat16

pipe = DiffusionPipeline.from_pretrained(
    mode_name, torch_dtype=dtype, token=hf_token, mode="S"
)
pipe.enable_vae_tiling()
pipe.to(device)

prompt = "Kitten eating a banana"
with torch.no_grad():
    torch.cuda.synchronize()
    (
        prompt_embeds,
        prompt_attention_mask,
        negative_prompt_embeds,
        negative_prompt_attention_mask,
    ) = pipe.encode_prompt(prompt=prompt)
    if prompt_attention_mask is not None and isinstance(
        prompt_attention_mask, torch.Tensor
    ):
        prompt_attention_mask = prompt_attention_mask.to(dtype)

    if negative_prompt_attention_mask is not None and isinstance(
        negative_prompt_attention_mask, torch.Tensor
    ):
        negative_prompt_attention_mask = negative_prompt_attention_mask.to(dtype)

    prompt_embeds = prompt_embeds.to(dtype)
    negative_prompt_embeds = negative_prompt_embeds.to(dtype)

    with torch.autocast("cuda", torch.bfloat16, enabled=True):
        frames = pipe(
            prompt_embeds=prompt_embeds,
            prompt_attention_mask=prompt_attention_mask,
            negative_prompt_embeds=negative_prompt_embeds,
            negative_prompt_attention_mask=negative_prompt_attention_mask,
            guidance_scale=4.5,
            num_inference_steps=64,
            height=480,
            width=848,
            num_frames=163,
            generator=torch.Generator("cuda").manual_seed(0),
            output_type="latent",
            return_dict=False,
        )[0]

    video_processor = VideoProcessor(vae_scale_factor=8)
    has_latents_mean = (
        hasattr(pipe.vae.config, "latents_mean")
        and pipe.vae.config.latents_mean is not None
    )
    has_latents_std = (
        hasattr(pipe.vae.config, "latents_std")
        and pipe.vae.config.latents_std is not None
    )

    if has_latents_mean and has_latents_std:
        latents_mean = (
            torch.tensor(pipe.vae.config.latents_mean)
            .view(1, 12, 1, 1, 1)
            .to(frames.device, frames.dtype)
        )
        latents_std = (
            torch.tensor(pipe.vae.config.latents_std)
            .view(1, 12, 1, 1, 1)
            .to(frames.device, frames.dtype)
        )
        frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean
    else:
        frames = frames / pipe.vae.config.scaling_factor

    with torch.autocast("cuda", torch.bfloat16, enabled=False):
        video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0]

    video = video_processor.postprocess_video(video)[0]
    torch.cuda.synchronize()
    export_to_video(video, "mochi.mp4", fps=30)

Installation

System requirements:

GPUs: H100, B200
CPU: AMD, Intel
Python: 3.10-3.12

To work with our models just run these lines in your terminal:

pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

# or for blackwell support
pip install elastic_models[blackwell]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple
pip install -U --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
pip install -U --pre torchvision --index-url https://download.pytorch.org/whl/nightly/cu128

pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex
pip install tensorrt==10.11.0.33 opencv-python==4.11.0.86 imageio-ffmpeg==0.6.0

Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:

thestage config set --api-token <YOUR_API_TOKEN>

Congrats, now you can use accelerated models!

Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms.

Latency benchmarks

Time in seconds of generation.

Number of frames: 100

GPU	S	XL	Original
H100	144	163	311
B200	77	87	241

Number of frames: 163

GPU	S	XL	Original
H100	328	361	675
B200	173	189	545

TheStageAI
/

Elastic-mochi-1-preview

Elastic model: Fastest self-serving models. mochi-1-preview.

Inference

Installation

Benchmarks

Latency benchmarks

Number of frames: 100

Number of frames: 163

Links

Model tree for TheStageAI/Elastic-mochi-1-preview

Collection including TheStageAI/Elastic-mochi-1-preview

Elastic Diffusers