Qwen3-VL-4B-Thinking
Qwen3-VL-4B-Thinking is a 4-billion-parameter multimodal vision-language model developed by the Qwen team at Alibaba Cloud. This "Thinking" variant emphasizes deeper multi-step reasoning, analysis, and planning with detailed chain-of-thought generation, making it ideal for advanced visual reasoning tasks across images, text, and video inputs.
Model Description
Qwen3-VL-4B-Thinking represents the latest advancement in the Qwen vision-language model series, designed to deliver superior text understanding, deeper visual perception, and extended reasoning capabilities. The model excels at producing structured outputs that expose intermediate reasoning steps, making it particularly valuable for research applications, multimodal understanding tasks, and agentic workflows.
Key Capabilities
- Advanced Visual Reasoning: Multi-step chain-of-thought generation for complex visual understanding
- Extended OCR Support: 32 languages with robust performance in low light, blur, and tilt conditions; improved handling of rare/ancient characters and specialized jargon
- Text-Vision Fusion: Text understanding capabilities on par with pure language models
- Visual Agent Functions: Operates PC and mobile GUIs for interactive applications
- Visual Coding: Generates Draw.io diagrams, HTML/CSS/JavaScript from images and videos
- Spatial Perception: Advanced 2D and 3D grounding with object positioning
- Video Understanding: Processes videos exceeding 1 hour with temporal event localization
- Extended Context: Handles up to 256K tokens (expandable to 1M tokens)
- Structured Output: Document analysis for invoices, forms, tables, and charts
Technical Architecture
The model employs three key architectural innovations:
- Interleaved-MRoPE: Enhanced positional embeddings for multimodal sequences
- DeepStack: Multi-level feature fusion for richer visual representations
- Text-Timestamp Alignment: Precise temporal grounding for video understanding
Vision Encoder: Enhanced Vision Transformer with window attention, optimized using SwiGLU and RMSNorm
Video Processing: Dynamic frame rate sampling with updated mRoPE using temporal IDs and absolute time alignment
Repository Contents
Note: This directory is prepared for model storage. Download model files from the official Hugging Face repository.
Expected Model Files
When downloaded, this repository will contain:
| File | Purpose | Approximate Size |
|---|---|---|
model.safetensors or sharded files |
Model weights (BF16) | ~8-9 GB |
config.json |
Model configuration | ~2 KB |
tokenizer.json |
Tokenizer vocabulary | ~7 MB |
tokenizer_config.json |
Tokenizer configuration | ~1 KB |
generation_config.json |
Generation parameters | ~1 KB |
preprocessor_config.json |
Vision preprocessor config | ~1 KB |
merges.txt |
BPE merge rules | ~2 MB |
vocab.json |
Vocabulary mapping | ~5 MB |
Total Repository Size: ~8-10 GB
Hardware Requirements
Minimum Requirements
- VRAM: 10 GB (with FP16/BF16 precision)
- RAM: 16 GB system memory
- Disk Space: 12 GB for model files
- GPU: NVIDIA GPU with Compute Capability 7.0+ (V100, RTX 20/30/40 series, A100, etc.)
Recommended Requirements
- VRAM: 16 GB or more for optimal performance
- RAM: 32 GB system memory
- GPU: NVIDIA A100, RTX 4090, or similar high-end GPUs
- Disk Space: 20 GB for models and cache
Quantized Versions
For reduced VRAM usage, consider:
- FP8 Version: Qwen3-VL-4B-Thinking-FP8 (~4-5 GB VRAM)
- GGUF Version: NexaAI/Qwen3-VL-4B-Thinking-GGUF for CPU inference
Usage Examples
Installation
Install the latest Transformers library from source:
pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils # Optional: for advanced visual input handling
pip install accelerate pillow torch torchvision
Basic Image Understanding
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model_path = "E:/huggingface/qwen3-vl-4b-thinking"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
# Load and process image
image = Image.open("your_image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail with reasoning steps."}
]
}
]
# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to(model.device)
# Generate with thinking parameters
output_ids = model.generate(
**inputs,
max_new_tokens=4096,
top_p=0.95,
top_k=20,
temperature=1.0
)
# Decode output
generated_text = processor.batch_decode(
output_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True
)[0]
print(generated_text)
Video Understanding
import cv2
from qwen_vl_utils import process_vision_info
# Extract video frames
video_path = "your_video.mp4"
cap = cv2.VideoCapture(video_path)
frames = []
while len(frames) < 16: # Sample 16 frames
ret, frame = cap.read()
if not ret:
break
frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
cap.release()
# Create video message
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": frames},
{"type": "text", "text": "Analyze the events in this video and explain your reasoning."}
]
}
]
# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], videos=[frames], return_tensors="pt")
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=4096, top_p=0.95)
result = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(result)
OCR and Document Analysis
# Load document image
document_image = Image.open("invoice.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": document_image},
{"type": "text", "text": "Extract all text from this document and structure it as JSON. Explain your extraction process."}
]
}
]
# Process with thinking approach
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[document_image], return_tensors="pt")
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=4096)
extracted_text = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(extracted_text)
Model Specifications
| Specification | Details |
|---|---|
| Parameters | 4 billion |
| Precision | BF16 (bfloat16) |
| Context Length | 256K tokens (expandable to 1M) |
| Languages (OCR) | 32 languages |
| Max Output Tokens | 40,960 (vision-language), 32,768 (text-only) |
| Architecture | Vision Transformer + Qwen3 LLM with Interleaved-MRoPE |
| License | Apache 2.0 |
| Release Date | October 15, 2025 |
| Format | Safetensors |
Recommended Generation Parameters
Vision-Language Tasks:
generation_config = {
"top_p": 0.95,
"top_k": 20,
"temperature": 1.0,
"max_new_tokens": 4096,
"do_sample": True
}
Text-Only Tasks:
generation_config = {
"top_p": 0.95,
"top_k": 20,
"temperature": 1.0,
"presence_penalty": 1.5,
"max_new_tokens": 32768,
"do_sample": True
}
Performance Tips
- Use BF16/FP16: Maintains quality while reducing memory usage
- Batch Processing: Process multiple images/videos together when possible
- Dynamic Frame Sampling: For long videos, use dynamic sampling to reduce computational cost
- Gradient Checkpointing: Enable for fine-tuning with limited VRAM
- Flash Attention: Use Flash Attention 2 for faster inference (requires compatible hardware)
- Quantization: Consider FP8 or GGUF versions for deployment scenarios
Memory Optimization
# Enable memory-efficient attention
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2" # Requires flash-attn package
)
# Or use 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.bfloat16
)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto"
)
Variants and Fine-Tunes
Official Variants
- Qwen3-VL-4B-Instruct: Standard instruction-tuned version
- Qwen3-VL-4B-Thinking-FP8: Quantized FP8 version
- NexaAI/Qwen3-VL-4B-Thinking-GGUF: GGUF format for CPU inference
Community Versions
- 23+ quantized variants available on Hugging Face
- 6+ fine-tuned versions for specialized tasks
Limitations
- Hallucinations: May generate plausible but incorrect visual interpretations
- Computational Requirements: Requires significant GPU resources for optimal performance
- Video Length: While supporting 1+ hour videos, processing time increases significantly
- Language Support: OCR optimized for 32 languages; others may have reduced accuracy
- Chain-of-Thought Verbosity: Thinking edition produces longer outputs; use Instruct version for concise responses
License
This model is released under the Apache License 2.0.
Copyright 2025 Qwen Team, Alibaba Cloud
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Citation
If you use Qwen3-VL-4B-Thinking in your research, please cite:
@article{qwen3vl2025,
title={Qwen3-VL: Superior Text Understanding \& Generation, Deeper Visual Perception \& Reasoning},
author={Qwen Team},
journal={arXiv preprint},
year={2025},
url={https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking}
}
Resources
- Official Repository: https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking
- GitHub: https://github.com/QwenLM/Qwen3-VL
- Documentation: https://huggingface.co/docs/transformers/main/model_doc/qwen3_vl
- Model Series: https://huggingface.co/Qwen
- Technical Blog: https://qwenlm.github.io/blog/qwen3-vl/
Acknowledgments
Developed by the Qwen team at Alibaba Cloud. Special thanks to the Hugging Face Transformers team for integration support and the open-source community for feedback and contributions.
Downloads: 29,713+ | Fine-Tunes: 6+ | Quantized Versions: 23+
- Downloads last month
- 151
16-bit