Qwen3-VL-4B-Thinking

Qwen3-VL-4B-Thinking is a 4-billion-parameter multimodal vision-language model developed by the Qwen team at Alibaba Cloud. This "Thinking" variant emphasizes deeper multi-step reasoning, analysis, and planning with detailed chain-of-thought generation, making it ideal for advanced visual reasoning tasks across images, text, and video inputs.

Model Description

Qwen3-VL-4B-Thinking represents the latest advancement in the Qwen vision-language model series, designed to deliver superior text understanding, deeper visual perception, and extended reasoning capabilities. The model excels at producing structured outputs that expose intermediate reasoning steps, making it particularly valuable for research applications, multimodal understanding tasks, and agentic workflows.

Key Capabilities

Advanced Visual Reasoning: Multi-step chain-of-thought generation for complex visual understanding
Extended OCR Support: 32 languages with robust performance in low light, blur, and tilt conditions; improved handling of rare/ancient characters and specialized jargon
Text-Vision Fusion: Text understanding capabilities on par with pure language models
Visual Agent Functions: Operates PC and mobile GUIs for interactive applications
Visual Coding: Generates Draw.io diagrams, HTML/CSS/JavaScript from images and videos
Spatial Perception: Advanced 2D and 3D grounding with object positioning
Video Understanding: Processes videos exceeding 1 hour with temporal event localization
Extended Context: Handles up to 256K tokens (expandable to 1M tokens)
Structured Output: Document analysis for invoices, forms, tables, and charts

Technical Architecture

The model employs three key architectural innovations:

Interleaved-MRoPE: Enhanced positional embeddings for multimodal sequences
DeepStack: Multi-level feature fusion for richer visual representations
Text-Timestamp Alignment: Precise temporal grounding for video understanding

Vision Encoder: Enhanced Vision Transformer with window attention, optimized using SwiGLU and RMSNorm

Video Processing: Dynamic frame rate sampling with updated mRoPE using temporal IDs and absolute time alignment

Repository Contents

Note: This directory is prepared for model storage. Download model files from the official Hugging Face repository.

Expected Model Files

When downloaded, this repository will contain:

File	Purpose	Approximate Size
`model.safetensors` or sharded files	Model weights (BF16)	~8-9 GB
`config.json`	Model configuration	~2 KB
`tokenizer.json`	Tokenizer vocabulary	~7 MB
`tokenizer_config.json`	Tokenizer configuration	~1 KB
`generation_config.json`	Generation parameters	~1 KB
`preprocessor_config.json`	Vision preprocessor config	~1 KB
`merges.txt`	BPE merge rules	~2 MB
`vocab.json`	Vocabulary mapping	~5 MB

Total Repository Size: ~8-10 GB

Hardware Requirements

Minimum Requirements

VRAM: 10 GB (with FP16/BF16 precision)
RAM: 16 GB system memory
Disk Space: 12 GB for model files
GPU: NVIDIA GPU with Compute Capability 7.0+ (V100, RTX 20/30/40 series, A100, etc.)

Recommended Requirements

VRAM: 16 GB or more for optimal performance
RAM: 32 GB system memory
GPU: NVIDIA A100, RTX 4090, or similar high-end GPUs
Disk Space: 20 GB for models and cache

Quantized Versions

For reduced VRAM usage, consider:

FP8 Version: Qwen3-VL-4B-Thinking-FP8 (~4-5 GB VRAM)
GGUF Version: NexaAI/Qwen3-VL-4B-Thinking-GGUF for CPU inference

Usage Examples

Installation

Install the latest Transformers library from source:

pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils  # Optional: for advanced visual input handling
pip install accelerate pillow torch torchvision

Basic Image Understanding

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model_path = "E:/huggingface/qwen3-vl-4b-thinking"
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)

# Load and process image
image = Image.open("your_image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail with reasoning steps."}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to(model.device)

# Generate with thinking parameters
output_ids = model.generate(
    **inputs,
    max_new_tokens=4096,
    top_p=0.95,
    top_k=20,
    temperature=1.0
)

# Decode output
generated_text = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)[0]
print(generated_text)

Video Understanding

import cv2
from qwen_vl_utils import process_vision_info

# Extract video frames
video_path = "your_video.mp4"
cap = cv2.VideoCapture(video_path)
frames = []
while len(frames) < 16:  # Sample 16 frames
    ret, frame = cap.read()
    if not ret:
        break
    frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
cap.release()

# Create video message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": frames},
            {"type": "text", "text": "Analyze the events in this video and explain your reasoning."}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], videos=[frames], return_tensors="pt")
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=4096, top_p=0.95)
result = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(result)

OCR and Document Analysis

# Load document image
document_image = Image.open("invoice.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": document_image},
            {"type": "text", "text": "Extract all text from this document and structure it as JSON. Explain your extraction process."}
        ]
    }
]

# Process with thinking approach
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[document_image], return_tensors="pt")
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=4096)
extracted_text = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(extracted_text)

Model Specifications

Specification	Details
Parameters	4 billion
Precision	BF16 (bfloat16)
Context Length	256K tokens (expandable to 1M)
Languages (OCR)	32 languages
Max Output Tokens	40,960 (vision-language), 32,768 (text-only)
Architecture	Vision Transformer + Qwen3 LLM with Interleaved-MRoPE
License	Apache 2.0
Release Date	October 15, 2025
Format	Safetensors

Recommended Generation Parameters

Vision-Language Tasks:

generation_config = {
    "top_p": 0.95,
    "top_k": 20,
    "temperature": 1.0,
    "max_new_tokens": 4096,
    "do_sample": True
}

Text-Only Tasks:

generation_config = {
    "top_p": 0.95,
    "top_k": 20,
    "temperature": 1.0,
    "presence_penalty": 1.5,
    "max_new_tokens": 32768,
    "do_sample": True
}

Performance Tips

Use BF16/FP16: Maintains quality while reducing memory usage
Batch Processing: Process multiple images/videos together when possible
Dynamic Frame Sampling: For long videos, use dynamic sampling to reduce computational cost
Gradient Checkpointing: Enable for fine-tuning with limited VRAM
Flash Attention: Use Flash Attention 2 for faster inference (requires compatible hardware)
Quantization: Consider FP8 or GGUF versions for deployment scenarios

Memory Optimization

# Enable memory-efficient attention
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Requires flash-attn package
)

# Or use 8-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
)

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto"
)

Variants and Fine-Tunes

Official Variants

Qwen3-VL-4B-Instruct: Standard instruction-tuned version
Qwen3-VL-4B-Thinking-FP8: Quantized FP8 version
NexaAI/Qwen3-VL-4B-Thinking-GGUF: GGUF format for CPU inference

Community Versions

23+ quantized variants available on Hugging Face
6+ fine-tuned versions for specialized tasks

Limitations

Hallucinations: May generate plausible but incorrect visual interpretations
Computational Requirements: Requires significant GPU resources for optimal performance
Video Length: While supporting 1+ hour videos, processing time increases significantly
Language Support: OCR optimized for 32 languages; others may have reduced accuracy
Chain-of-Thought Verbosity: Thinking edition produces longer outputs; use Instruct version for concise responses

License

This model is released under the Apache License 2.0.

Copyright 2025 Qwen Team, Alibaba Cloud

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Citation

If you use Qwen3-VL-4B-Thinking in your research, please cite:

@article{qwen3vl2025,
  title={Qwen3-VL: Superior Text Understanding \& Generation, Deeper Visual Perception \& Reasoning},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2025},
  url={https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking}
}

Resources

Official Repository: https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking
GitHub: https://github.com/QwenLM/Qwen3-VL
Documentation: https://huggingface.co/docs/transformers/main/model_doc/qwen3_vl
Model Series: https://huggingface.co/Qwen
Technical Blog: https://qwenlm.github.io/blog/qwen3-vl/

Acknowledgments

Developed by the Qwen team at Alibaba Cloud. Special thanks to the Hugging Face Transformers team for integration support and the open-source community for feedback and contributions.

Downloads: 29,713+ | Fine-Tunes: 6+ | Quantized Versions: 23+

Downloads last month: 151

GGUF

Model size

4B params

Architecture

qwen3vl

Hardware compatibility

16-bit

Collection including wangkanai/qwen3-vl-4b-thinking

qwen3-vl

Collection

Qwen3 vision language • 9 items • Updated 2 days ago • 1