--- title: "Optimized Diffusers Code" emoji: 🔥 colorFrom: purple colorTo: gray sdk: gradio sdk_version: 5.31.0 app_file: app.py pinned: false short_description: 'Optimize Diffusers Code on your hardware.' --- Still a WIP. Use an LLM to generate reasonable code snippets in a hardware-aware manner for Diffusers. ### Motivation Within the Diffusers, we support a bunch of optimization techniques (refer [here](https://huggingface.co/docs/diffusers/main/en/optimization/memory), [here](https://huggingface.co/docs/diffusers/main/en/optimization/cache), and [here](https://huggingface.co/docs/diffusers/main/en/optimization/fp16)). However, it can be daunting for our users to determine when to use what. Hence, this repository tries to take a stab at using an LLM to generate reasonable code snippets for a given pipeline checkpoint that respects user hardware configuration. ## Getting started Install the requirements from `requirements.txt`. Configure `GOOGLE_API_KEY` in the environment: `export GOOGLE_API_KEY=...`. Then run: ```bash python e2e_example.py ``` By default, the `e2e_example.py` script uses Flux.1-Dev, but this can be configured through the `--ckpt_id` argument. Full usage: ```sh usage: e2e_example.py [-h] [--ckpt_id CKPT_ID] [--gemini_model GEMINI_MODEL] [--variant VARIANT] [--enable_lossy] options: -h, --help show this help message and exit --ckpt_id CKPT_ID Can be a repo id from the Hub or a local path where the checkpoint is stored. --gemini_model GEMINI_MODEL Gemini model to use. Choose from https://ai.google.dev/gemini-api/docs/models. --variant VARIANT If the `ckpt_id` has variants, supply this flag to estimate compute. Example: 'fp16'. --enable_lossy When enabled, the code will include snippets for enabling quantization. ``` ## Example outputs
python e2e_example.py (ran on an H100) ````sh System RAM: 1999.99 GB RAM Category: large GPU VRAM: 79.65 GB VRAM Category: large current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: False\nenable_torch_compile: True\n' Sending request to Gemini... ```python from diffusers import DiffusionPipeline import torch # User-provided information: # pipeline_loading_memory_GB: 31.424 # available_system_ram_GB: 1999.9855346679688 (Large RAM) # available_gpu_vram_GB: 79.6474609375 (Large VRAM) # enable_lossy_outputs: False # enable_torch_compile: True # --- Configuration based on user needs and system capabilities --- # Placeholder for the actual checkpoint ID # Please replace this with your desired model checkpoint ID. CKPT_ID = "black-forest-labs/FLUX.1-dev" # Determine dtype. bfloat16 is generally recommended for performance on compatible GPUs. # Ensure your GPU supports bfloat16 for optimal performance. dtype = torch.bfloat16 # 1. Pipeline Loading and Device Placement: # Available VRAM (79.64 GB) is significantly greater than the pipeline's loading memory (31.42 GB). # Therefore, the entire pipeline can comfortably fit and run on the GPU. print(f"Loading pipeline '{CKPT_ID}' with {dtype} precision...") pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=dtype) print("Moving pipeline to CUDA (GPU) as VRAM is sufficient...") pipe = pipe.to("cuda") # 2. Quantization: # User specified `enable_lossy_outputs: False`, so no quantization is applied. print("Quantization is NOT applied as per user's preference for lossless outputs.") # 3. Torch Compile: # User specified `enable_torch_compile: True`. # Since no offloading was applied (the entire model is on GPU), we can use `fullgraph=True` # for potentially greater performance benefits. print("Applying torch.compile() to the transformer for accelerated inference...") # The transformer is typically the most compute-intensive part of the diffusion pipeline. # Compiling it can lead to significant speedups. pipe.transformer.compile(fullgraph=True) # --- Inference --- print("Starting inference...") prompt = "photo of a dog sitting beside a river, high quality, 4k" image = pipe(prompt).images[0] print("Inference completed. Displaying image.") # Save or display the image image.save("generated_image.png") print("Image saved as generated_image.png") # You can also display the image directly if running in an environment that supports it # image.show() ``` ````

python e2e_example.py --enable_lossy ````sh System RAM: 1999.99 GB RAM Category: large GPU VRAM: 79.65 GB VRAM Category: large current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: True\nenable_torch_compile: True\n' Sending request to Gemini... ```python import torch from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig import os # --- User-provided information and derived constants --- # Checkpoint ID (assuming a placeholder since it was not provided in the user input) # Using the example CKPT_ID from the problem description CKPT_ID = "black-forest-labs/FLUX.1-dev" # Derived from available_gpu_vram_GB (79.64 GB) and pipeline_loading_memory_GB (31.424 GB) # VRAM is ample to load the entire pipeline use_cuda_direct_load = True # Derived from enable_lossy_outputs (True) enable_quantization = True # Derived from enable_torch_compile (True) enable_torch_compile = True # --- Inference Code --- print(f"Loading pipeline: {CKPT_ID}") # 1. Quantization Configuration (since enable_lossy_outputs is True) quant_config = None if enable_quantization: # Default to bitsandbytes 4-bit as per guidance print("Enabling bitsandbytes 4-bit quantization for 'transformer' component.") quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"}, # For FLUX.1-dev, the main generative component is typically 'transformer'. # For other pipelines, you might include 'unet', 'text_encoder', 'text_encoder_2', etc. components_to_quantize=["transformer"] ) # 2. Load the Diffusion Pipeline # Use bfloat16 for better performance and modern GPU compatibility pipe = DiffusionPipeline.from_pretrained( CKPT_ID, torch_dtype=torch.bfloat16, quantization_config=quant_config if enable_quantization else None ) # 3. Move Pipeline to GPU (since VRAM is ample) if use_cuda_direct_load: print("Moving the entire pipeline to CUDA (GPU).") pipe = pipe.to("cuda") # 4. Apply torch.compile() (since enable_torch_compile is True) if enable_torch_compile: print("Applying torch.compile() for speedup.") # This setting is beneficial when bitsandbytes is used torch._dynamo.config.capture_dynamic_output_shape_ops = True # Since no offloading is applied (model fits fully in VRAM), use fullgraph=True # The primary component for compilation in FLUX.1-dev is 'transformer' print("Compiling pipe.transformer with fullgraph=True.") pipe.transformer = torch.compile(pipe.transformer, fullgraph=True) # 5. Perform Inference print("Starting image generation...") prompt = "photo of a dog sitting beside a river" num_inference_steps = 28 # A reasonable number of steps for good quality # Ensure all inputs are on the correct device for inference after compilation with torch.no_grad(): image = pipe(prompt, num_inference_steps=num_inference_steps).images[0] print("Image generation complete.") # Save or display the image output_path = "generated_image.png" image.save(output_path) print(f"Image saved to {output_path}") ``` ````

When invoked from an RTX 4090, it outputs:
Expand ````sh System RAM: 125.54 GB RAM Category: large GPU VRAM: 23.99 GB VRAM Category: medium current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 125.54026794433594\navailable_gpu_vram_GB: 23.98828125\nenable_lossy_outputs: False\nenable_torch_compile: True\n' Sending request to Gemini... ```python import torch from diffusers import DiffusionPipeline import os # For creating offload directories if needed, though not directly used in this solution # --- User-provided information (interpreted) --- # Checkpoint ID will be a placeholder as it's not provided by the user directly in the input. # pipeline_loading_memory_GB: 31.424 GB # available_system_ram_GB: 125.54 GB (Categorized as "large": > 40GB) # available_gpu_vram_GB: 23.98 GB (Categorized as "medium": > 8GB <= 24GB) # enable_lossy_outputs: False (User prefers no quantization) # enable_torch_compile: True (User wants to enable torch.compile) # --- Configuration --- # Placeholder for the actual checkpoint ID. Replace with the desired model ID. CKPT_ID = "black-forest-labs/FLUX.1-dev" # Example from Diffusers library. PROMPT = "photo of a dog sitting beside a river" print(f"--- Optimizing inference for CKPT_ID: {CKPT_ID} ---") print(f"Pipeline loading memory: {31.424} GB") print(f"Available System RAM: {125.54} GB (Large)") print(f"Available GPU VRAM: {23.98} GB (Medium)") print(f"Lossy outputs (quantization): {'Disabled' if not False else 'Enabled'}") print(f"Torch.compile: {'Enabled' if True else 'Disabled'}") print("-" * 50) # --- 1. Load the Diffusion Pipeline --- # Use bfloat16 for a good balance of memory and performance. print(f"Loading pipeline '{CKPT_ID}' with torch_dtype=torch.bfloat16...") pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=torch.bfloat16) print("Pipeline loaded.") # --- 2. Apply Memory Optimizations --- # Analysis: # - Pipeline memory (31.424 GB) exceeds available GPU VRAM (23.98 GB). # - System RAM (125.54 GB) is large. # Strategy: Use `enable_model_cpu_offload()`. This moves model components to CPU when not # in use, swapping them to GPU on demand. This is ideal when VRAM is insufficient but system # RAM is abundant. print("Applying memory optimization: `pipe.enable_model_cpu_offload()`...") pipe.enable_model_cpu_offload() print("Model CPU offloading enabled. Components will dynamically move between CPU and GPU.") # --- 3. Apply Speed Optimizations (torch.compile) --- # Analysis: # - `enable_torch_compile` is True. # - Model offloading (`enable_model_cpu_offload`) is applied. # Strategy: Enable torch.compile with `recompile_limit` as offloading is used. # Do not use `fullgraph=True` when offloading is active. print("Applying speed optimization: `torch.compile()`...") torch._dynamo.config.recompile_limit = 1000 # Recommended when offloading is applied. # torch._dynamo.config.capture_dynamic_output_shape_ops = True # Only for bitsandbytes, not applicable here. # Compile the main computational component (e.g., transformer or unet). # FLUX models primarily use a transformer. For other models, it might be `pipe.unet`. if hasattr(pipe, "transformer"): print("Compiling `pipe.transformer`...") pipe.transformer.compile() elif hasattr(pipe, "unet"): print("Compiling `pipe.unet`...") pipe.unet.compile() else: print("Warning: Neither `pipe.transformer` nor `pipe.unet` found for compilation. Skipping `torch.compile` for core component.") print("Speed optimizations applied.") # --- 4. Perform Inference --- print(f"Starting image generation for prompt: '{PROMPT}'") # The pipeline handles component swapping automatically due to `enable_model_cpu_offload()`. image = pipe(PROMPT).images[0] print("Image generation complete.") # --- 5. Save or Display the Result --- output_path = "generated_image.png" image.save(output_path) print(f"Generated image saved to '{output_path}'") print("\n--- Inference process finished successfully ---") ``` ````
### More outputs
"Wan-AI/Wan2.1-T2V-14B-Diffusers" with lossy outputs enabled ````sh System RAM: 125.54 GB RAM Category: large GPU VRAM: 23.99 GB VRAM Category: medium ("current_generate_prompt='\\nckpt_id: " 'Wan-AI/Wan2.1-T2V-14B-Diffusers\\npipeline_loading_memory_GB: ' '37.432\\navailable_system_ram_GB: ' '125.54026794433594\\navailable_gpu_vram_GB: ' '23.98828125\\nenable_lossy_outputs: True\\nis_fp8_supported: ' "True\\nenable_torch_compile: True\\n'") Sending request to Gemini... ```python from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig import torch ckpt_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers" quant_config = PipelineQuantizationConfig( quant_backend="torchao", quant_kwargs={"quant_type": "float8dq_e4m3_row"}, components_to_quantize=["transformer"] ) pipe = DiffusionPipeline.from_pretrained(ckpt_id, quantization_config=quant_config, torch_dtype=torch.bfloat16) # Apply model CPU offload due to VRAM constraints pipe.enable_model_cpu_offload() # torch.compile() configuration torch._dynamo.config.recompile_limit = 1000 pipe.transformer.compile() # pipe.vae.decode = torch.compile(pipe.vae.decode) # Uncomment if you want to compile VAE decode as well prompt = "photo of a dog sitting beside a river" # Modify the pipe call arguments as needed. image = pipe(prompt).images[0] # You can save the image or perform further operations here # image.save("generated_image.png") ``` ````
Ran on an RTX 4090