WAN 2.2 FP8 I2V - Image-to-Video and Text-to-Video Models

High-quality text-to-video (T2V) and image-to-video (I2V) generation models in FP8 quantized format for memory-efficient deployment on consumer-grade GPUs.

Model Description

WAN 2.2 FP8 is a 14-billion parameter video generation model based on diffusion architecture, optimized with FP8 quantization for efficient deployment. This repository contains FP8 quantized variants that provide excellent quality with significantly reduced VRAM requirements compared to FP16 models (~50% memory reduction).

Key Features:

14B parameter diffusion-based video generation architecture
FP8 E4M3FN quantization for memory efficiency
Dual noise schedules (high-noise for creativity, low-noise for faithfulness)
Support for both text-to-video and image-to-video generation
Production-ready .safetensors format

Model Statistics:

Total Repository Size: ~56GB
Model Architecture: Diffusion transformer (14B parameters)
Precision: FP8 E4M3FN quantization
Format: .safetensors (secure tensor format)
Input: Text prompts or text + images
Output: Video sequences (typically 16-24 frames)

Repository Contents

Text-to-Video (T2V) Models

Located in diffusion_models/wan/

Model	Size	Noise Schedule	Use Case
`wan22-t2v-14b-fp8-high-scaled.safetensors`	14GB	High-noise	Creative T2V, higher variance outputs
`wan22-t2v-14b-fp8-low-scaled.safetensors`	14GB	Low-noise	Faithful T2V, consistent results

Total T2V models: 28GB

Image-to-Video (I2V) Models

Located in diffusion_models/wan/

Model	Size	Noise Schedule	Use Case
`wan22-i2v-14b-fp8-high-scaled.safetensors`	14GB	High-noise	Creative I2V, artistic interpretation
`wan22-i2v-14b-fp8-low-scaled.safetensors`	14GB	Low-noise	Faithful I2V, accurate reproduction

Total I2V models: 26GB

Hardware Requirements

Model Type	Minimum VRAM	Recommended VRAM	GPU Examples
T2V FP8	16GB	20GB+	RTX 4080, RTX 3090, RTX 4070 Ti Super
I2V FP8	16GB	20GB+	RTX 4080, RTX 3090, RTX 4070 Ti Super

System Requirements:

VRAM: 16GB minimum, 20GB+ recommended
Disk Space: 56GB for full repository (14GB per model)
System RAM: 32GB+ recommended
CUDA: 11.8+ or 12.1+
PyTorch: 2.1+ with FP8 support
diffusers: 0.20+ or compatible library

Compatible GPUs:

NVIDIA RTX 4090 (24GB) - Excellent
NVIDIA RTX 4080 (16GB) - Good
NVIDIA RTX 3090 (24GB) - Excellent
NVIDIA RTX 3090 Ti (24GB) - Excellent
NVIDIA RTX 4070 Ti Super (16GB) - Good
NVIDIA A5000 (24GB) - Excellent

Usage Examples

Text-to-Video Generation (FP8)

from diffusers import DiffusionPipeline
import torch

# Load T2V pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
    "path-to-base-wan22-model",
    torch_dtype=torch.float8_e4m3fn
)

# Load WAN 2.2 FP8 T2V model (low-noise for consistent results)
pipe.unet.from_single_file(
    "E:/huggingface/wan22-fp8-i2v/diffusion_models/wan/wan22-t2v-14b-fp8-low-scaled.safetensors"
)

pipe.to("cuda")

# Generate video from text prompt
video = pipe(
    prompt="a cat walking through a garden, cinematic, high quality",
    num_inference_steps=50,
    num_frames=16,
    guidance_scale=7.5
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output_t2v.mp4", fps=8)

Image-to-Video Generation (FP8)

from diffusers import DiffusionPipeline
import torch
from PIL import Image

# Load input image
input_image = Image.open("path/to/your/image.jpg")

# Load I2V pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
    "path-to-base-wan22-model",
    torch_dtype=torch.float8_e4m3fn
)

# Load WAN 2.2 FP8 I2V model (high-noise for creative output)
pipe.unet.from_single_file(
    "E:/huggingface/wan22-fp8-i2v/diffusion_models/wan/wan22-i2v-14b-fp8-high-scaled.safetensors"
)

pipe.to("cuda")

# Generate video from image
video = pipe(
    image=input_image,
    prompt="cinematic camera movement, high quality",
    num_inference_steps=50,
    num_frames=16,
    guidance_scale=7.5
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output_i2v.mp4", fps=8)

Advanced: Memory-Efficient Generation

# Enable memory optimizations for 16GB GPUs
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()

# Generate with reduced memory footprint
video = pipe(
    prompt="your prompt here",
    num_inference_steps=50,
    num_frames=12,  # Reduced from 16 for memory savings
    guidance_scale=7.5
).frames

Model Specifications

Architecture Details

Model Type: Diffusion transformer for video generation
Parameters: 14 billion
Precision: FP8 E4M3FN (8-bit floating point)
Memory Footprint: ~14GB per model (50% reduction vs FP16)
Format: SafeTensors (secure, efficient serialization)

Noise Schedules

High-Noise Models (*-high-scaled.safetensors):

Greater noise variance during diffusion process
More creative and artistic interpretation
Higher output variance and diversity
Best for: Abstract content, artistic videos, creative exploration

Low-Noise Models (*-low-scaled.safetensors):

Lower noise variance during diffusion process
More faithful to input prompts/images
More consistent and predictable results
Best for: Realistic content, precise control, production use

FP8 Quantization Benefits

Memory Efficiency: 50% smaller than FP16 (14GB vs 27GB per model)
Speed: Faster inference on GPUs with FP8 tensor cores (RTX 40 series)
Quality: Minimal quality degradation compared to FP16
Accessibility: Enables deployment on 16GB consumer GPUs
Compatibility: Works with standard diffusers pipelines

Performance Tips

Memory Optimization

Enable CPU Offloading: Offload model components to CPU when not in use
```
pipe.enable_model_cpu_offload()
```
Enable Attention Optimization: Use xformers for memory-efficient attention
```
pipe.enable_xformers_memory_efficient_attention()
```
Reduce Frame Count: Generate fewer frames for memory savings
```
num_frames=12  # Instead of 16
```
Sequential CPU Offload: Most aggressive memory savings
```
pipe.enable_sequential_cpu_offload()
```

Quality Optimization

Choose Appropriate Noise Schedule:
- Use low-noise models for realistic, faithful generation
- Use high-noise models for creative, artistic results
Increase Inference Steps: More steps = better quality (50-100 recommended)
```
num_inference_steps=75  # Higher quality, slower
```

Adjust Guidance Scale: Control prompt adherence (7.5 is standard)

guidance_scale=7.5  # Lower = more creative, Higher = more literal

Speed Optimization

Use FP8 on RTX 40 Series: Native tensor core acceleration
Reduce Inference Steps: Faster generation with slight quality trade-off
```
num_inference_steps=30  # Faster, lower quality
```
Reduce Frame Count: Fewer frames = faster generation
Enable xformers: Faster attention computation

GPU-Specific Recommendations

RTX 40 Series (4080, 4090): Excellent FP8 performance, use native precision
RTX 30 Series (3090, 3090 Ti): Good FP8 support, memory-efficient
16GB GPUs: Enable CPU offloading and xformers for best results
24GB GPUs: Can run without optimizations, room for larger batches

Model Selection Guide

Noise Schedule Selection

Content Type	Recommended Model	Reason
Realistic videos	Low-noise	Faithful reproduction, consistency
Artistic/abstract	High-noise	Creative interpretation, variety
Product demos	Low-noise	Predictable, professional results
Creative exploration	High-noise	Diverse outputs, experimentation
Production work	Low-noise	Consistent, reliable results

Task Selection

Task	Models	Description
Text-to-Video	`wan22-t2v-*`	Generate videos from text prompts only
Image-to-Video	`wan22-i2v-*`	Animate static images with text guidance

Prompting Guidelines

Effective T2V Prompts

"a cat walking through a garden, cinematic lighting, high quality, 4k"
"drone shot of mountain landscape at sunset, volumetric lighting"
"close-up of coffee being poured, slow motion, professional cinematography"
"time-lapse of city traffic at night, long exposure, urban photography"

Effective I2V Prompts

"cinematic camera movement, smooth motion"
"gentle zoom in, professional cinematography"
"dynamic action, high energy movement"
"subtle animation, natural motion"

Quality Keywords

Cinematography: "cinematic", "professional", "high quality", "4k"
Lighting: "volumetric lighting", "dramatic lighting", "soft light"
Camera: "smooth motion", "stabilized", "professional camera work"
Style: "realistic", "photorealistic", "detailed", "sharp"

Intended Uses

Direct Use

Content Creation: Video generation for creative projects, advertising, social media
Prototyping: Rapid visualization of video concepts and storyboards
Research: Academic research in video generation and diffusion models
Application Development: Building video generation features in apps and services

Downstream Use

Fine-tuning on domain-specific video datasets
Integration with video editing and post-production pipelines
Custom LoRA development for specialized effects
Synthetic data generation for training other AI models

Out-of-Scope Use

The model should NOT be used for:

Generating deceptive, harmful, or misleading video content
Creating deepfakes or non-consensual content of individuals
Producing content violating copyright or intellectual property rights
Generating content for harassment, abuse, or discrimination
Creating videos for illegal purposes or activities

Limitations

Technical Limitations

Temporal Consistency: May produce flickering or motion inconsistencies in long sequences
Fine Details: Small objects or intricate textures may lack detail
Physical Realism: Generated physics may not follow real-world rules perfectly
Text Rendering: Cannot reliably render readable text in generated videos
Memory Requirements: Requires 16GB+ VRAM, limiting accessibility
Frame Count: Limited to shorter video sequences (typically 16-24 frames)

Content Limitations

Training data biases may affect representation of diverse demographics
May struggle with uncommon objects, rare scenarios, or niche content
Generated content may reflect biases present in training data
Complex motions or interactions may be challenging

Bias, Risks, and Limitations

Known Risks

Misuse Risks:

Deepfakes: Could be used to create deceptive or misleading content
- Mitigation: Implement watermarking and content authentication
Copyright: May generate content similar to copyrighted material
- Mitigation: Content filtering and responsible use policies
Harmful Content: Could generate inappropriate content
- Mitigation: Safety filters and content moderation

Ethical Considerations

Obtain appropriate permissions before generating videos of identifiable individuals
Clearly label AI-generated content to prevent deception
Consider environmental impact of compute-intensive inference
Respect privacy, consent, and intellectual property rights

Recommendations

Implement content moderation and safety filters in production
Add watermarks to identify AI-generated content
Provide clear disclaimers for AI-generated videos
Monitor for misuse and implement usage policies
Validate outputs for biases or harmful content

License

This repository uses the "other" license tag. Please check the original WAN 2.2 model repository for specific license terms, usage restrictions, and commercial use permissions.

Citation

If you use WAN 2.2 FP8 in your research or applications, please cite the original model:

@misc{wan22-fp8,
  title={WAN 2.2 FP8: Text-to-Video and Image-to-Video Generation},
  author={WAN Team},
  year={2024},
  howpublished={\url{https://huggingface.co/wan22}},
  note={FP8 quantized variant}
}

Troubleshooting

Out of Memory Errors

Problem: CUDA out of memory during generation

Solutions:

Enable CPU offloading: pipe.enable_model_cpu_offload()
Enable sequential offload: pipe.enable_sequential_cpu_offload()
Reduce frame count: num_frames=12 (instead of 16)
Enable xformers: pipe.enable_xformers_memory_efficient_attention()
Close other GPU applications
Reduce batch size to 1

Quality Issues

Problem: Generated videos have poor quality or artifacts

Solutions:

Try both high-noise and low-noise variants
Increase inference steps to 75-100
Adjust guidance scale (try 6.0-9.0 range)
Improve prompt quality with specific details
Use low-noise models for more consistent results

Slow Generation

Problem: Video generation is too slow

Solutions:

Enable xformers: pipe.enable_xformers_memory_efficient_attention()
Reduce inference steps to 30-40 for testing
Use RTX 40 series GPUs for better FP8 performance
Reduce frame count for faster iteration
Close background applications

Model Loading Issues

Problem: Cannot load model or incorrect format errors

Solutions:

Verify model path is correct with absolute path
Ensure diffusers library supports FP8 (version 0.20+)
Check PyTorch version supports FP8 (2.1+)
Verify CUDA version compatibility (11.8+ or 12.1+)
Use from_single_file() method for safetensors loading

Related Resources

WAN 2.2 Official Repository: [Link to official HuggingFace repo]
Diffusers Documentation: https://huggingface.co/docs/diffusers
FP8 Training Guide: [Link to FP8 documentation]
Community Examples: [Link to community resources]

Version History

v1.0 (August 2024)

Initial release with 4 FP8 quantized models
2 text-to-video models (high-noise, low-noise)
2 image-to-video models (high-noise, low-noise)
Total repository size: ~56GB

Contact

For questions, issues, or contributions:

Open an issue in the Hugging Face repository
Refer to the original WAN 2.2 model documentation
Check community discussions for common questions

Model Card Authors

This model card was created following Hugging Face model card guidelines and best practices for responsible AI documentation.

Last Updated: October 14, 2025 Model Version: WAN 2.2 FP8 I2V v1.0 Repository Type: Quantized Model Weights Total Size: ~56GB (4 models × 14GB each)

Downloads last month: -

Collection including wangkanai/wan22-fp8-i2v

wan-2.2

Collection

WAN 2.2 video models • 27 items • Updated 5 days ago • 1