Spaces:

diffusers
/

optimized-diffusers-code

Running

File size: 13,428 Bytes

f4bd48c
 
 
7920feb
f4bd48c
 
 
 
 
 
 
 
ec3f4e3

---
title: "Optimized Diffusers Code"
emoji: 🔥
colorFrom: purple
colorTo: gray
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
short_description: 'Optimize Diffusers Code on your hardware.'
---

Still a WIP. Use an LLM to generate reasonable code snippets in a hardware-aware manner for Diffusers.

### Motivation

Within the Diffusers, we support a bunch of optimization techniques (refer [here](https://huggingface.co/docs/diffusers/main/en/optimization/memory), [here](https://huggingface.co/docs/diffusers/main/en/optimization/cache), and [here](https://huggingface.co/docs/diffusers/main/en/optimization/fp16)). However, it can be
daunting for our users to determine when to use what. Hence, this repository tries to take a stab
at using an LLM to generate reasonable code snippets for a given pipeline checkpoint that respects
user hardware configuration.

## Getting started

Install the requirements from `requirements.txt`.

Configure `GOOGLE_API_KEY` in the environment: `export GOOGLE_API_KEY=...`.

Then run:

```bash
python e2e_example.py 
```

By default, the `e2e_example.py` script uses Flux.1-Dev, but this can be configured through the `--ckpt_id` argument.

Full usage:

```sh
usage: e2e_example.py [-h] [--ckpt_id CKPT_ID] [--gemini_model GEMINI_MODEL] [--variant VARIANT] [--enable_lossy]

options:
  -h, --help            show this help message and exit
  --ckpt_id CKPT_ID     Can be a repo id from the Hub or a local path where the checkpoint is stored.
  --gemini_model GEMINI_MODEL
                        Gemini model to use. Choose from https://ai.google.dev/gemini-api/docs/models.
  --variant VARIANT     If the `ckpt_id` has variants, supply this flag to estimate compute. Example: 'fp16'.
  --enable_lossy        When enabled, the code will include snippets for enabling quantization.
```

## Example outputs

<details>
<summary>python e2e_example.py (ran on an H100)</summary>

````sh
System RAM: 1999.99 GB
RAM Category: large

GPU VRAM: 79.65 GB
VRAM Category: large
current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: False\nenable_torch_compile: True\n'
Sending request to Gemini...
```python
from diffusers import DiffusionPipeline
import torch

# User-provided information:
# pipeline_loading_memory_GB: 31.424
# available_system_ram_GB: 1999.9855346679688 (Large RAM)
# available_gpu_vram_GB: 79.6474609375 (Large VRAM)
# enable_lossy_outputs: False
# enable_torch_compile: True

# --- Configuration based on user needs and system capabilities ---

# Placeholder for the actual checkpoint ID
# Please replace this with your desired model checkpoint ID.
CKPT_ID = "black-forest-labs/FLUX.1-dev" 

# Determine dtype. bfloat16 is generally recommended for performance on compatible GPUs.
# Ensure your GPU supports bfloat16 for optimal performance.
dtype = torch.bfloat16

# 1. Pipeline Loading and Device Placement:
# Available VRAM (79.64 GB) is significantly greater than the pipeline's loading memory (31.42 GB).
# Therefore, the entire pipeline can comfortably fit and run on the GPU.
print(f"Loading pipeline '{CKPT_ID}' with {dtype} precision...")
pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=dtype)

print("Moving pipeline to CUDA (GPU) as VRAM is sufficient...")
pipe = pipe.to("cuda")

# 2. Quantization:
# User specified `enable_lossy_outputs: False`, so no quantization is applied.
print("Quantization is NOT applied as per user's preference for lossless outputs.")

# 3. Torch Compile:
# User specified `enable_torch_compile: True`.
# Since no offloading was applied (the entire model is on GPU), we can use `fullgraph=True`
# for potentially greater performance benefits.
print("Applying torch.compile() to the transformer for accelerated inference...")
# The transformer is typically the most compute-intensive part of the diffusion pipeline.
# Compiling it can lead to significant speedups.
pipe.transformer.compile(fullgraph=True)

# --- Inference ---
print("Starting inference...")
prompt = "photo of a dog sitting beside a river, high quality, 4k"
image = pipe(prompt).images[0]

print("Inference completed. Displaying image.")
# Save or display the image
image.save("generated_image.png")
print("Image saved as generated_image.png")

# You can also display the image directly if running in an environment that supports it
# image.show()
```
````
<br>
</details>
<br>
<details>
<summary>python e2e_example.py --enable_lossy</summary>

````sh
System RAM: 1999.99 GB
RAM Category: large

GPU VRAM: 79.65 GB
VRAM Category: large
current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: True\nenable_torch_compile: True\n'
Sending request to Gemini...
```python
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
import os

# --- User-provided information and derived constants ---
# Checkpoint ID (assuming a placeholder since it was not provided in the user input)
# Using the example CKPT_ID from the problem description
CKPT_ID = "black-forest-labs/FLUX.1-dev"

# Derived from available_gpu_vram_GB (79.64 GB) and pipeline_loading_memory_GB (31.424 GB)
# VRAM is ample to load the entire pipeline
use_cuda_direct_load = True 

# Derived from enable_lossy_outputs (True)
enable_quantization = True

# Derived from enable_torch_compile (True)
enable_torch_compile = True

# --- Inference Code ---

print(f"Loading pipeline: {CKPT_ID}")

# 1. Quantization Configuration (since enable_lossy_outputs is True)
quant_config = None
if enable_quantization:
    # Default to bitsandbytes 4-bit as per guidance
    print("Enabling bitsandbytes 4-bit quantization for 'transformer' component.")
    quant_config = PipelineQuantizationConfig(
        quant_backend="bitsandbytes_4bit", 
        quant_kwargs={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"},
        # For FLUX.1-dev, the main generative component is typically 'transformer'.
        # For other pipelines, you might include 'unet', 'text_encoder', 'text_encoder_2', etc.
        components_to_quantize=["transformer"] 
    )

# 2. Load the Diffusion Pipeline
# Use bfloat16 for better performance and modern GPU compatibility
pipe = DiffusionPipeline.from_pretrained(
    CKPT_ID, 
    torch_dtype=torch.bfloat16,
    quantization_config=quant_config if enable_quantization else None
)

# 3. Move Pipeline to GPU (since VRAM is ample)
if use_cuda_direct_load:
    print("Moving the entire pipeline to CUDA (GPU).")
    pipe = pipe.to("cuda")

# 4. Apply torch.compile() (since enable_torch_compile is True)
if enable_torch_compile:
    print("Applying torch.compile() for speedup.")
    # This setting is beneficial when bitsandbytes is used
    torch._dynamo.config.capture_dynamic_output_shape_ops = True 
    
    # Since no offloading is applied (model fits fully in VRAM), use fullgraph=True
    # The primary component for compilation in FLUX.1-dev is 'transformer'
    print("Compiling pipe.transformer with fullgraph=True.")
    pipe.transformer = torch.compile(pipe.transformer, fullgraph=True)

# 5. Perform Inference
print("Starting image generation...")
prompt = "photo of a dog sitting beside a river"
num_inference_steps = 28 # A reasonable number of steps for good quality

# Ensure all inputs are on the correct device for inference after compilation
with torch.no_grad():
    image = pipe(prompt, num_inference_steps=num_inference_steps).images[0]

print("Image generation complete.")
# Save or display the image
output_path = "generated_image.png"
image.save(output_path)
print(f"Image saved to {output_path}")

```
````

</details>
<br>
When invoked from an RTX 4090, it outputs:

<details>
<summary>Expand</summary>

````sh
System RAM: 125.54 GB
RAM Category: large

GPU VRAM: 23.99 GB
VRAM Category: medium
current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 125.54026794433594\navailable_gpu_vram_GB: 23.98828125\nenable_lossy_outputs: False\nenable_torch_compile: True\n'
Sending request to Gemini...
```python
import torch
from diffusers import DiffusionPipeline
import os # For creating offload directories if needed, though not directly used in this solution

# --- User-provided information (interpreted) ---
# Checkpoint ID will be a placeholder as it's not provided by the user directly in the input.
# pipeline_loading_memory_GB: 31.424 GB
# available_system_ram_GB: 125.54 GB (Categorized as "large": > 40GB)
# available_gpu_vram_GB: 23.98 GB (Categorized as "medium": > 8GB <= 24GB)
# enable_lossy_outputs: False (User prefers no quantization)
# enable_torch_compile: True (User wants to enable torch.compile)

# --- Configuration ---
# Placeholder for the actual checkpoint ID. Replace with the desired model ID.
CKPT_ID = "black-forest-labs/FLUX.1-dev" # Example from Diffusers library.
PROMPT = "photo of a dog sitting beside a river"

print(f"--- Optimizing inference for CKPT_ID: {CKPT_ID} ---")
print(f"Pipeline loading memory: {31.424} GB")
print(f"Available System RAM: {125.54} GB (Large)")
print(f"Available GPU VRAM: {23.98} GB (Medium)")
print(f"Lossy outputs (quantization): {'Disabled' if not False else 'Enabled'}")
print(f"Torch.compile: {'Enabled' if True else 'Disabled'}")
print("-" * 50)

# --- 1. Load the Diffusion Pipeline ---
# Use bfloat16 for a good balance of memory and performance.
print(f"Loading pipeline '{CKPT_ID}' with torch_dtype=torch.bfloat16...")
pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=torch.bfloat16)
print("Pipeline loaded.")

# --- 2. Apply Memory Optimizations ---
# Analysis:
# - Pipeline memory (31.424 GB) exceeds available GPU VRAM (23.98 GB).
# - System RAM (125.54 GB) is large.
# Strategy: Use `enable_model_cpu_offload()`. This moves model components to CPU when not
# in use, swapping them to GPU on demand. This is ideal when VRAM is insufficient but system
# RAM is abundant.

print("Applying memory optimization: `pipe.enable_model_cpu_offload()`...")
pipe.enable_model_cpu_offload()
print("Model CPU offloading enabled. Components will dynamically move between CPU and GPU.")

# --- 3. Apply Speed Optimizations (torch.compile) ---
# Analysis:
# - `enable_torch_compile` is True.
# - Model offloading (`enable_model_cpu_offload`) is applied.
# Strategy: Enable torch.compile with `recompile_limit` as offloading is used.
# Do not use `fullgraph=True` when offloading is active.

print("Applying speed optimization: `torch.compile()`...")
torch._dynamo.config.recompile_limit = 1000 # Recommended when offloading is applied.
# torch._dynamo.config.capture_dynamic_output_shape_ops = True # Only for bitsandbytes, not applicable here.

# Compile the main computational component (e.g., transformer or unet).
# FLUX models primarily use a transformer. For other models, it might be `pipe.unet`.
if hasattr(pipe, "transformer"):
    print("Compiling `pipe.transformer`...")
    pipe.transformer.compile()
elif hasattr(pipe, "unet"):
    print("Compiling `pipe.unet`...")
    pipe.unet.compile()
else:
    print("Warning: Neither `pipe.transformer` nor `pipe.unet` found for compilation. Skipping `torch.compile` for core component.")

print("Speed optimizations applied.")

# --- 4. Perform Inference ---
print(f"Starting image generation for prompt: '{PROMPT}'")
# The pipeline handles component swapping automatically due to `enable_model_cpu_offload()`.
image = pipe(PROMPT).images[0]
print("Image generation complete.")

# --- 5. Save or Display the Result ---
output_path = "generated_image.png"
image.save(output_path)
print(f"Generated image saved to '{output_path}'")

print("\n--- Inference process finished successfully ---")
```
````
</details>

### More outputs

<details>
<summary>"Wan-AI/Wan2.1-T2V-14B-Diffusers" with lossy outputs enabled</summary>

````sh
System RAM: 125.54 GB
RAM Category: large

GPU VRAM: 23.99 GB
VRAM Category: medium
("current_generate_prompt='\\nckpt_id: "
 'Wan-AI/Wan2.1-T2V-14B-Diffusers\\npipeline_loading_memory_GB: '
 '37.432\\navailable_system_ram_GB: '
 '125.54026794433594\\navailable_gpu_vram_GB: '
 '23.98828125\\nenable_lossy_outputs: True\\nis_fp8_supported: '
 "True\\nenable_torch_compile: True\\n'")
Sending request to Gemini...
```python
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
import torch

ckpt_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"

quant_config = PipelineQuantizationConfig(
    quant_backend="torchao",
    quant_kwargs={"quant_type": "float8dq_e4m3_row"},
    components_to_quantize=["transformer"]
)
pipe = DiffusionPipeline.from_pretrained(ckpt_id, quantization_config=quant_config, torch_dtype=torch.bfloat16)

# Apply model CPU offload due to VRAM constraints
pipe.enable_model_cpu_offload()

# torch.compile() configuration
torch._dynamo.config.recompile_limit = 1000
pipe.transformer.compile()
# pipe.vae.decode = torch.compile(pipe.vae.decode) # Uncomment if you want to compile VAE decode as well

prompt = "photo of a dog sitting beside a river"

# Modify the pipe call arguments as needed.
image = pipe(prompt).images[0]

# You can save the image or perform further operations here
# image.save("generated_image.png")
```
````
</details>
<small>Ran on an RTX 4090</small>