WAN 2.2 FP8 I2V - Image-to-Video and Text-to-Video Models
High-quality text-to-video (T2V) and image-to-video (I2V) generation models in FP8 quantized format for memory-efficient deployment on consumer-grade GPUs.
Model Description
WAN 2.2 FP8 is a 14-billion parameter video generation model based on diffusion architecture, optimized with FP8 quantization for efficient deployment. This repository contains FP8 quantized variants that provide excellent quality with significantly reduced VRAM requirements compared to FP16 models (~50% memory reduction).
Key Features:
- 14B parameter diffusion-based video generation architecture
- FP8 E4M3FN quantization for memory efficiency
- Dual noise schedules (high-noise for creativity, low-noise for faithfulness)
- Support for both text-to-video and image-to-video generation
- Production-ready
.safetensorsformat
Model Statistics:
- Total Repository Size: ~56GB
- Model Architecture: Diffusion transformer (14B parameters)
- Precision: FP8 E4M3FN quantization
- Format:
.safetensors(secure tensor format) - Input: Text prompts or text + images
- Output: Video sequences (typically 16-24 frames)
Repository Contents
Text-to-Video (T2V) Models
Located in diffusion_models/wan/
| Model | Size | Noise Schedule | Use Case |
|---|---|---|---|
wan22-t2v-14b-fp8-high-scaled.safetensors |
14GB | High-noise | Creative T2V, higher variance outputs |
wan22-t2v-14b-fp8-low-scaled.safetensors |
14GB | Low-noise | Faithful T2V, consistent results |
Total T2V models: 28GB
Image-to-Video (I2V) Models
Located in diffusion_models/wan/
| Model | Size | Noise Schedule | Use Case |
|---|---|---|---|
wan22-i2v-14b-fp8-high-scaled.safetensors |
14GB | High-noise | Creative I2V, artistic interpretation |
wan22-i2v-14b-fp8-low-scaled.safetensors |
14GB | Low-noise | Faithful I2V, accurate reproduction |
Total I2V models: 26GB
Hardware Requirements
| Model Type | Minimum VRAM | Recommended VRAM | GPU Examples |
|---|---|---|---|
| T2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090, RTX 4070 Ti Super |
| I2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090, RTX 4070 Ti Super |
System Requirements:
- VRAM: 16GB minimum, 20GB+ recommended
- Disk Space: 56GB for full repository (14GB per model)
- System RAM: 32GB+ recommended
- CUDA: 11.8+ or 12.1+
- PyTorch: 2.1+ with FP8 support
- diffusers: 0.20+ or compatible library
Compatible GPUs:
- NVIDIA RTX 4090 (24GB) - Excellent
- NVIDIA RTX 4080 (16GB) - Good
- NVIDIA RTX 3090 (24GB) - Excellent
- NVIDIA RTX 3090 Ti (24GB) - Excellent
- NVIDIA RTX 4070 Ti Super (16GB) - Good
- NVIDIA A5000 (24GB) - Excellent
Usage Examples
Text-to-Video Generation (FP8)
from diffusers import DiffusionPipeline
import torch
# Load T2V pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
"path-to-base-wan22-model",
torch_dtype=torch.float8_e4m3fn
)
# Load WAN 2.2 FP8 T2V model (low-noise for consistent results)
pipe.unet.from_single_file(
"E:/huggingface/wan22-fp8-i2v/diffusion_models/wan/wan22-t2v-14b-fp8-low-scaled.safetensors"
)
pipe.to("cuda")
# Generate video from text prompt
video = pipe(
prompt="a cat walking through a garden, cinematic, high quality",
num_inference_steps=50,
num_frames=16,
guidance_scale=7.5
).frames
# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output_t2v.mp4", fps=8)
Image-to-Video Generation (FP8)
from diffusers import DiffusionPipeline
import torch
from PIL import Image
# Load input image
input_image = Image.open("path/to/your/image.jpg")
# Load I2V pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
"path-to-base-wan22-model",
torch_dtype=torch.float8_e4m3fn
)
# Load WAN 2.2 FP8 I2V model (high-noise for creative output)
pipe.unet.from_single_file(
"E:/huggingface/wan22-fp8-i2v/diffusion_models/wan/wan22-i2v-14b-fp8-high-scaled.safetensors"
)
pipe.to("cuda")
# Generate video from image
video = pipe(
image=input_image,
prompt="cinematic camera movement, high quality",
num_inference_steps=50,
num_frames=16,
guidance_scale=7.5
).frames
# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output_i2v.mp4", fps=8)
Advanced: Memory-Efficient Generation
# Enable memory optimizations for 16GB GPUs
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()
# Generate with reduced memory footprint
video = pipe(
prompt="your prompt here",
num_inference_steps=50,
num_frames=12, # Reduced from 16 for memory savings
guidance_scale=7.5
).frames
Model Specifications
Architecture Details
- Model Type: Diffusion transformer for video generation
- Parameters: 14 billion
- Precision: FP8 E4M3FN (8-bit floating point)
- Memory Footprint: ~14GB per model (50% reduction vs FP16)
- Format: SafeTensors (secure, efficient serialization)
Noise Schedules
High-Noise Models (*-high-scaled.safetensors):
- Greater noise variance during diffusion process
- More creative and artistic interpretation
- Higher output variance and diversity
- Best for: Abstract content, artistic videos, creative exploration
Low-Noise Models (*-low-scaled.safetensors):
- Lower noise variance during diffusion process
- More faithful to input prompts/images
- More consistent and predictable results
- Best for: Realistic content, precise control, production use
FP8 Quantization Benefits
- Memory Efficiency: 50% smaller than FP16 (14GB vs 27GB per model)
- Speed: Faster inference on GPUs with FP8 tensor cores (RTX 40 series)
- Quality: Minimal quality degradation compared to FP16
- Accessibility: Enables deployment on 16GB consumer GPUs
- Compatibility: Works with standard diffusers pipelines
Performance Tips
Memory Optimization
Enable CPU Offloading: Offload model components to CPU when not in use
pipe.enable_model_cpu_offload()Enable Attention Optimization: Use xformers for memory-efficient attention
pipe.enable_xformers_memory_efficient_attention()Reduce Frame Count: Generate fewer frames for memory savings
num_frames=12 # Instead of 16Sequential CPU Offload: Most aggressive memory savings
pipe.enable_sequential_cpu_offload()
Quality Optimization
Choose Appropriate Noise Schedule:
- Use low-noise models for realistic, faithful generation
- Use high-noise models for creative, artistic results
Increase Inference Steps: More steps = better quality (50-100 recommended)
num_inference_steps=75 # Higher quality, slowerAdjust Guidance Scale: Control prompt adherence (7.5 is standard)
guidance_scale=7.5 # Lower = more creative, Higher = more literal
Speed Optimization
- Use FP8 on RTX 40 Series: Native tensor core acceleration
- Reduce Inference Steps: Faster generation with slight quality trade-off
num_inference_steps=30 # Faster, lower quality - Reduce Frame Count: Fewer frames = faster generation
- Enable xformers: Faster attention computation
GPU-Specific Recommendations
- RTX 40 Series (4080, 4090): Excellent FP8 performance, use native precision
- RTX 30 Series (3090, 3090 Ti): Good FP8 support, memory-efficient
- 16GB GPUs: Enable CPU offloading and xformers for best results
- 24GB GPUs: Can run without optimizations, room for larger batches
Model Selection Guide
Noise Schedule Selection
| Content Type | Recommended Model | Reason |
|---|---|---|
| Realistic videos | Low-noise | Faithful reproduction, consistency |
| Artistic/abstract | High-noise | Creative interpretation, variety |
| Product demos | Low-noise | Predictable, professional results |
| Creative exploration | High-noise | Diverse outputs, experimentation |
| Production work | Low-noise | Consistent, reliable results |
Task Selection
| Task | Models | Description |
|---|---|---|
| Text-to-Video | wan22-t2v-* |
Generate videos from text prompts only |
| Image-to-Video | wan22-i2v-* |
Animate static images with text guidance |
Prompting Guidelines
Effective T2V Prompts
"a cat walking through a garden, cinematic lighting, high quality, 4k"
"drone shot of mountain landscape at sunset, volumetric lighting"
"close-up of coffee being poured, slow motion, professional cinematography"
"time-lapse of city traffic at night, long exposure, urban photography"
Effective I2V Prompts
"cinematic camera movement, smooth motion"
"gentle zoom in, professional cinematography"
"dynamic action, high energy movement"
"subtle animation, natural motion"
Quality Keywords
- Cinematography: "cinematic", "professional", "high quality", "4k"
- Lighting: "volumetric lighting", "dramatic lighting", "soft light"
- Camera: "smooth motion", "stabilized", "professional camera work"
- Style: "realistic", "photorealistic", "detailed", "sharp"
Intended Uses
Direct Use
- Content Creation: Video generation for creative projects, advertising, social media
- Prototyping: Rapid visualization of video concepts and storyboards
- Research: Academic research in video generation and diffusion models
- Application Development: Building video generation features in apps and services
Downstream Use
- Fine-tuning on domain-specific video datasets
- Integration with video editing and post-production pipelines
- Custom LoRA development for specialized effects
- Synthetic data generation for training other AI models
Out-of-Scope Use
The model should NOT be used for:
- Generating deceptive, harmful, or misleading video content
- Creating deepfakes or non-consensual content of individuals
- Producing content violating copyright or intellectual property rights
- Generating content for harassment, abuse, or discrimination
- Creating videos for illegal purposes or activities
Limitations
Technical Limitations
- Temporal Consistency: May produce flickering or motion inconsistencies in long sequences
- Fine Details: Small objects or intricate textures may lack detail
- Physical Realism: Generated physics may not follow real-world rules perfectly
- Text Rendering: Cannot reliably render readable text in generated videos
- Memory Requirements: Requires 16GB+ VRAM, limiting accessibility
- Frame Count: Limited to shorter video sequences (typically 16-24 frames)
Content Limitations
- Training data biases may affect representation of diverse demographics
- May struggle with uncommon objects, rare scenarios, or niche content
- Generated content may reflect biases present in training data
- Complex motions or interactions may be challenging
Bias, Risks, and Limitations
Known Risks
Misuse Risks:
- Deepfakes: Could be used to create deceptive or misleading content
- Mitigation: Implement watermarking and content authentication
- Copyright: May generate content similar to copyrighted material
- Mitigation: Content filtering and responsible use policies
- Harmful Content: Could generate inappropriate content
- Mitigation: Safety filters and content moderation
Ethical Considerations
- Obtain appropriate permissions before generating videos of identifiable individuals
- Clearly label AI-generated content to prevent deception
- Consider environmental impact of compute-intensive inference
- Respect privacy, consent, and intellectual property rights
Recommendations
- Implement content moderation and safety filters in production
- Add watermarks to identify AI-generated content
- Provide clear disclaimers for AI-generated videos
- Monitor for misuse and implement usage policies
- Validate outputs for biases or harmful content
License
This repository uses the "other" license tag. Please check the original WAN 2.2 model repository for specific license terms, usage restrictions, and commercial use permissions.
Citation
If you use WAN 2.2 FP8 in your research or applications, please cite the original model:
@misc{wan22-fp8,
title={WAN 2.2 FP8: Text-to-Video and Image-to-Video Generation},
author={WAN Team},
year={2024},
howpublished={\url{https://huggingface.co/wan22}},
note={FP8 quantized variant}
}
Troubleshooting
Out of Memory Errors
Problem: CUDA out of memory during generation
Solutions:
- Enable CPU offloading:
pipe.enable_model_cpu_offload() - Enable sequential offload:
pipe.enable_sequential_cpu_offload() - Reduce frame count:
num_frames=12(instead of 16) - Enable xformers:
pipe.enable_xformers_memory_efficient_attention() - Close other GPU applications
- Reduce batch size to 1
Quality Issues
Problem: Generated videos have poor quality or artifacts
Solutions:
- Try both high-noise and low-noise variants
- Increase inference steps to 75-100
- Adjust guidance scale (try 6.0-9.0 range)
- Improve prompt quality with specific details
- Use low-noise models for more consistent results
Slow Generation
Problem: Video generation is too slow
Solutions:
- Enable xformers:
pipe.enable_xformers_memory_efficient_attention() - Reduce inference steps to 30-40 for testing
- Use RTX 40 series GPUs for better FP8 performance
- Reduce frame count for faster iteration
- Close background applications
Model Loading Issues
Problem: Cannot load model or incorrect format errors
Solutions:
- Verify model path is correct with absolute path
- Ensure diffusers library supports FP8 (version 0.20+)
- Check PyTorch version supports FP8 (2.1+)
- Verify CUDA version compatibility (11.8+ or 12.1+)
- Use
from_single_file()method for safetensors loading
Related Resources
- WAN 2.2 Official Repository: [Link to official HuggingFace repo]
- Diffusers Documentation: https://huggingface.co/docs/diffusers
- FP8 Training Guide: [Link to FP8 documentation]
- Community Examples: [Link to community resources]
Version History
v1.0 (August 2024)
- Initial release with 4 FP8 quantized models
- 2 text-to-video models (high-noise, low-noise)
- 2 image-to-video models (high-noise, low-noise)
- Total repository size: ~56GB
Contact
For questions, issues, or contributions:
- Open an issue in the Hugging Face repository
- Refer to the original WAN 2.2 model documentation
- Check community discussions for common questions
Model Card Authors
This model card was created following Hugging Face model card guidelines and best practices for responsible AI documentation.
Last Updated: October 14, 2025 Model Version: WAN 2.2 FP8 I2V v1.0 Repository Type: Quantized Model Weights Total Size: ~56GB (4 models × 14GB each)
- Downloads last month
- -