---
title: "Optimized Diffusers Code"
emoji: 🔥
colorFrom: purple
colorTo: gray
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
short_description: 'Optimize Diffusers Code on your hardware.'
---
Still a WIP. Use an LLM to generate reasonable code snippets in a hardware-aware manner for Diffusers.
### Motivation
Within the Diffusers, we support a bunch of optimization techniques (refer [here](https://huggingface.co/docs/diffusers/main/en/optimization/memory), [here](https://huggingface.co/docs/diffusers/main/en/optimization/cache), and [here](https://huggingface.co/docs/diffusers/main/en/optimization/fp16)). However, it can be
daunting for our users to determine when to use what. Hence, this repository tries to take a stab
at using an LLM to generate reasonable code snippets for a given pipeline checkpoint that respects
user hardware configuration.
## Getting started
Install the requirements from `requirements.txt`.
Configure `GOOGLE_API_KEY` in the environment: `export GOOGLE_API_KEY=...`.
Then run:
```bash
python e2e_example.py
```
By default, the `e2e_example.py` script uses Flux.1-Dev, but this can be configured through the `--ckpt_id` argument.
Full usage:
```sh
usage: e2e_example.py [-h] [--ckpt_id CKPT_ID] [--gemini_model GEMINI_MODEL] [--variant VARIANT] [--enable_lossy]
options:
-h, --help show this help message and exit
--ckpt_id CKPT_ID Can be a repo id from the Hub or a local path where the checkpoint is stored.
--gemini_model GEMINI_MODEL
Gemini model to use. Choose from https://ai.google.dev/gemini-api/docs/models.
--variant VARIANT If the `ckpt_id` has variants, supply this flag to estimate compute. Example: 'fp16'.
--enable_lossy When enabled, the code will include snippets for enabling quantization.
```
## Example outputs
python e2e_example.py (ran on an H100)
````sh
System RAM: 1999.99 GB
RAM Category: large
GPU VRAM: 79.65 GB
VRAM Category: large
current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: False\nenable_torch_compile: True\n'
Sending request to Gemini...
```python
from diffusers import DiffusionPipeline
import torch
# User-provided information:
# pipeline_loading_memory_GB: 31.424
# available_system_ram_GB: 1999.9855346679688 (Large RAM)
# available_gpu_vram_GB: 79.6474609375 (Large VRAM)
# enable_lossy_outputs: False
# enable_torch_compile: True
# --- Configuration based on user needs and system capabilities ---
# Placeholder for the actual checkpoint ID
# Please replace this with your desired model checkpoint ID.
CKPT_ID = "black-forest-labs/FLUX.1-dev"
# Determine dtype. bfloat16 is generally recommended for performance on compatible GPUs.
# Ensure your GPU supports bfloat16 for optimal performance.
dtype = torch.bfloat16
# 1. Pipeline Loading and Device Placement:
# Available VRAM (79.64 GB) is significantly greater than the pipeline's loading memory (31.42 GB).
# Therefore, the entire pipeline can comfortably fit and run on the GPU.
print(f"Loading pipeline '{CKPT_ID}' with {dtype} precision...")
pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=dtype)
print("Moving pipeline to CUDA (GPU) as VRAM is sufficient...")
pipe = pipe.to("cuda")
# 2. Quantization:
# User specified `enable_lossy_outputs: False`, so no quantization is applied.
print("Quantization is NOT applied as per user's preference for lossless outputs.")
# 3. Torch Compile:
# User specified `enable_torch_compile: True`.
# Since no offloading was applied (the entire model is on GPU), we can use `fullgraph=True`
# for potentially greater performance benefits.
print("Applying torch.compile() to the transformer for accelerated inference...")
# The transformer is typically the most compute-intensive part of the diffusion pipeline.
# Compiling it can lead to significant speedups.
pipe.transformer.compile(fullgraph=True)
# --- Inference ---
print("Starting inference...")
prompt = "photo of a dog sitting beside a river, high quality, 4k"
image = pipe(prompt).images[0]
print("Inference completed. Displaying image.")
# Save or display the image
image.save("generated_image.png")
print("Image saved as generated_image.png")
# You can also display the image directly if running in an environment that supports it
# image.show()
```
````
python e2e_example.py --enable_lossy
````sh
System RAM: 1999.99 GB
RAM Category: large
GPU VRAM: 79.65 GB
VRAM Category: large
current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: True\nenable_torch_compile: True\n'
Sending request to Gemini...
```python
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
import os
# --- User-provided information and derived constants ---
# Checkpoint ID (assuming a placeholder since it was not provided in the user input)
# Using the example CKPT_ID from the problem description
CKPT_ID = "black-forest-labs/FLUX.1-dev"
# Derived from available_gpu_vram_GB (79.64 GB) and pipeline_loading_memory_GB (31.424 GB)
# VRAM is ample to load the entire pipeline
use_cuda_direct_load = True
# Derived from enable_lossy_outputs (True)
enable_quantization = True
# Derived from enable_torch_compile (True)
enable_torch_compile = True
# --- Inference Code ---
print(f"Loading pipeline: {CKPT_ID}")
# 1. Quantization Configuration (since enable_lossy_outputs is True)
quant_config = None
if enable_quantization:
# Default to bitsandbytes 4-bit as per guidance
print("Enabling bitsandbytes 4-bit quantization for 'transformer' component.")
quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"},
# For FLUX.1-dev, the main generative component is typically 'transformer'.
# For other pipelines, you might include 'unet', 'text_encoder', 'text_encoder_2', etc.
components_to_quantize=["transformer"]
)
# 2. Load the Diffusion Pipeline
# Use bfloat16 for better performance and modern GPU compatibility
pipe = DiffusionPipeline.from_pretrained(
CKPT_ID,
torch_dtype=torch.bfloat16,
quantization_config=quant_config if enable_quantization else None
)
# 3. Move Pipeline to GPU (since VRAM is ample)
if use_cuda_direct_load:
print("Moving the entire pipeline to CUDA (GPU).")
pipe = pipe.to("cuda")
# 4. Apply torch.compile() (since enable_torch_compile is True)
if enable_torch_compile:
print("Applying torch.compile() for speedup.")
# This setting is beneficial when bitsandbytes is used
torch._dynamo.config.capture_dynamic_output_shape_ops = True
# Since no offloading is applied (model fits fully in VRAM), use fullgraph=True
# The primary component for compilation in FLUX.1-dev is 'transformer'
print("Compiling pipe.transformer with fullgraph=True.")
pipe.transformer = torch.compile(pipe.transformer, fullgraph=True)
# 5. Perform Inference
print("Starting image generation...")
prompt = "photo of a dog sitting beside a river"
num_inference_steps = 28 # A reasonable number of steps for good quality
# Ensure all inputs are on the correct device for inference after compilation
with torch.no_grad():
image = pipe(prompt, num_inference_steps=num_inference_steps).images[0]
print("Image generation complete.")
# Save or display the image
output_path = "generated_image.png"
image.save(output_path)
print(f"Image saved to {output_path}")
```
````
When invoked from an RTX 4090, it outputs:
Expand
````sh
System RAM: 125.54 GB
RAM Category: large
GPU VRAM: 23.99 GB
VRAM Category: medium
current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 125.54026794433594\navailable_gpu_vram_GB: 23.98828125\nenable_lossy_outputs: False\nenable_torch_compile: True\n'
Sending request to Gemini...
```python
import torch
from diffusers import DiffusionPipeline
import os # For creating offload directories if needed, though not directly used in this solution
# --- User-provided information (interpreted) ---
# Checkpoint ID will be a placeholder as it's not provided by the user directly in the input.
# pipeline_loading_memory_GB: 31.424 GB
# available_system_ram_GB: 125.54 GB (Categorized as "large": > 40GB)
# available_gpu_vram_GB: 23.98 GB (Categorized as "medium": > 8GB <= 24GB)
# enable_lossy_outputs: False (User prefers no quantization)
# enable_torch_compile: True (User wants to enable torch.compile)
# --- Configuration ---
# Placeholder for the actual checkpoint ID. Replace with the desired model ID.
CKPT_ID = "black-forest-labs/FLUX.1-dev" # Example from Diffusers library.
PROMPT = "photo of a dog sitting beside a river"
print(f"--- Optimizing inference for CKPT_ID: {CKPT_ID} ---")
print(f"Pipeline loading memory: {31.424} GB")
print(f"Available System RAM: {125.54} GB (Large)")
print(f"Available GPU VRAM: {23.98} GB (Medium)")
print(f"Lossy outputs (quantization): {'Disabled' if not False else 'Enabled'}")
print(f"Torch.compile: {'Enabled' if True else 'Disabled'}")
print("-" * 50)
# --- 1. Load the Diffusion Pipeline ---
# Use bfloat16 for a good balance of memory and performance.
print(f"Loading pipeline '{CKPT_ID}' with torch_dtype=torch.bfloat16...")
pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=torch.bfloat16)
print("Pipeline loaded.")
# --- 2. Apply Memory Optimizations ---
# Analysis:
# - Pipeline memory (31.424 GB) exceeds available GPU VRAM (23.98 GB).
# - System RAM (125.54 GB) is large.
# Strategy: Use `enable_model_cpu_offload()`. This moves model components to CPU when not
# in use, swapping them to GPU on demand. This is ideal when VRAM is insufficient but system
# RAM is abundant.
print("Applying memory optimization: `pipe.enable_model_cpu_offload()`...")
pipe.enable_model_cpu_offload()
print("Model CPU offloading enabled. Components will dynamically move between CPU and GPU.")
# --- 3. Apply Speed Optimizations (torch.compile) ---
# Analysis:
# - `enable_torch_compile` is True.
# - Model offloading (`enable_model_cpu_offload`) is applied.
# Strategy: Enable torch.compile with `recompile_limit` as offloading is used.
# Do not use `fullgraph=True` when offloading is active.
print("Applying speed optimization: `torch.compile()`...")
torch._dynamo.config.recompile_limit = 1000 # Recommended when offloading is applied.
# torch._dynamo.config.capture_dynamic_output_shape_ops = True # Only for bitsandbytes, not applicable here.
# Compile the main computational component (e.g., transformer or unet).
# FLUX models primarily use a transformer. For other models, it might be `pipe.unet`.
if hasattr(pipe, "transformer"):
print("Compiling `pipe.transformer`...")
pipe.transformer.compile()
elif hasattr(pipe, "unet"):
print("Compiling `pipe.unet`...")
pipe.unet.compile()
else:
print("Warning: Neither `pipe.transformer` nor `pipe.unet` found for compilation. Skipping `torch.compile` for core component.")
print("Speed optimizations applied.")
# --- 4. Perform Inference ---
print(f"Starting image generation for prompt: '{PROMPT}'")
# The pipeline handles component swapping automatically due to `enable_model_cpu_offload()`.
image = pipe(PROMPT).images[0]
print("Image generation complete.")
# --- 5. Save or Display the Result ---
output_path = "generated_image.png"
image.save(output_path)
print(f"Generated image saved to '{output_path}'")
print("\n--- Inference process finished successfully ---")
```
````
### More outputs
"Wan-AI/Wan2.1-T2V-14B-Diffusers" with lossy outputs enabled
````sh
System RAM: 125.54 GB
RAM Category: large
GPU VRAM: 23.99 GB
VRAM Category: medium
("current_generate_prompt='\\nckpt_id: "
'Wan-AI/Wan2.1-T2V-14B-Diffusers\\npipeline_loading_memory_GB: '
'37.432\\navailable_system_ram_GB: '
'125.54026794433594\\navailable_gpu_vram_GB: '
'23.98828125\\nenable_lossy_outputs: True\\nis_fp8_supported: '
"True\\nenable_torch_compile: True\\n'")
Sending request to Gemini...
```python
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
import torch
ckpt_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
quant_config = PipelineQuantizationConfig(
quant_backend="torchao",
quant_kwargs={"quant_type": "float8dq_e4m3_row"},
components_to_quantize=["transformer"]
)
pipe = DiffusionPipeline.from_pretrained(ckpt_id, quantization_config=quant_config, torch_dtype=torch.bfloat16)
# Apply model CPU offload due to VRAM constraints
pipe.enable_model_cpu_offload()
# torch.compile() configuration
torch._dynamo.config.recompile_limit = 1000
pipe.transformer.compile()
# pipe.vae.decode = torch.compile(pipe.vae.decode) # Uncomment if you want to compile VAE decode as well
prompt = "photo of a dog sitting beside a river"
# Modify the pipe call arguments as needed.
image = pipe(prompt).images[0]
# You can save the image or perform further operations here
# image.save("generated_image.png")
```
````
Ran on an RTX 4090