Qwen3-VL-4B-Thinking

Qwen3-VL-4B-Thinking is a 4-billion-parameter multimodal vision-language model developed by the Qwen team at Alibaba Cloud. This "Thinking" variant emphasizes deeper multi-step reasoning, analysis, and planning with detailed chain-of-thought generation, making it ideal for advanced visual reasoning tasks across images, text, and video inputs.

Model Description

Qwen3-VL-4B-Thinking represents the latest advancement in the Qwen vision-language model series, designed to deliver superior text understanding, deeper visual perception, and extended reasoning capabilities. The model excels at producing structured outputs that expose intermediate reasoning steps, making it particularly valuable for research applications, multimodal understanding tasks, and agentic workflows.

Key Capabilities

  • Advanced Visual Reasoning: Multi-step chain-of-thought generation for complex visual understanding
  • Extended OCR Support: 32 languages with robust performance in low light, blur, and tilt conditions; improved handling of rare/ancient characters and specialized jargon
  • Text-Vision Fusion: Text understanding capabilities on par with pure language models
  • Visual Agent Functions: Operates PC and mobile GUIs for interactive applications
  • Visual Coding: Generates Draw.io diagrams, HTML/CSS/JavaScript from images and videos
  • Spatial Perception: Advanced 2D and 3D grounding with object positioning
  • Video Understanding: Processes videos exceeding 1 hour with temporal event localization
  • Extended Context: Handles up to 256K tokens (expandable to 1M tokens)
  • Structured Output: Document analysis for invoices, forms, tables, and charts

Technical Architecture

The model employs three key architectural innovations:

  1. Interleaved-MRoPE: Enhanced positional embeddings for multimodal sequences
  2. DeepStack: Multi-level feature fusion for richer visual representations
  3. Text-Timestamp Alignment: Precise temporal grounding for video understanding

Vision Encoder: Enhanced Vision Transformer with window attention, optimized using SwiGLU and RMSNorm

Video Processing: Dynamic frame rate sampling with updated mRoPE using temporal IDs and absolute time alignment

Repository Contents

Note: This directory is prepared for model storage. Download model files from the official Hugging Face repository.

Expected Model Files

When downloaded, this repository will contain:

File Purpose Approximate Size
model.safetensors or sharded files Model weights (BF16) ~8-9 GB
config.json Model configuration ~2 KB
tokenizer.json Tokenizer vocabulary ~7 MB
tokenizer_config.json Tokenizer configuration ~1 KB
generation_config.json Generation parameters ~1 KB
preprocessor_config.json Vision preprocessor config ~1 KB
merges.txt BPE merge rules ~2 MB
vocab.json Vocabulary mapping ~5 MB

Total Repository Size: ~8-10 GB

Hardware Requirements

Minimum Requirements

  • VRAM: 10 GB (with FP16/BF16 precision)
  • RAM: 16 GB system memory
  • Disk Space: 12 GB for model files
  • GPU: NVIDIA GPU with Compute Capability 7.0+ (V100, RTX 20/30/40 series, A100, etc.)

Recommended Requirements

  • VRAM: 16 GB or more for optimal performance
  • RAM: 32 GB system memory
  • GPU: NVIDIA A100, RTX 4090, or similar high-end GPUs
  • Disk Space: 20 GB for models and cache

Quantized Versions

For reduced VRAM usage, consider:

Usage Examples

Installation

Install the latest Transformers library from source:

pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils  # Optional: for advanced visual input handling
pip install accelerate pillow torch torchvision

Basic Image Understanding

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model_path = "E:/huggingface/qwen3-vl-4b-thinking"
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)

# Load and process image
image = Image.open("your_image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail with reasoning steps."}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to(model.device)

# Generate with thinking parameters
output_ids = model.generate(
    **inputs,
    max_new_tokens=4096,
    top_p=0.95,
    top_k=20,
    temperature=1.0
)

# Decode output
generated_text = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)[0]
print(generated_text)

Video Understanding

import cv2
from qwen_vl_utils import process_vision_info

# Extract video frames
video_path = "your_video.mp4"
cap = cv2.VideoCapture(video_path)
frames = []
while len(frames) < 16:  # Sample 16 frames
    ret, frame = cap.read()
    if not ret:
        break
    frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
cap.release()

# Create video message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": frames},
            {"type": "text", "text": "Analyze the events in this video and explain your reasoning."}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], videos=[frames], return_tensors="pt")
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=4096, top_p=0.95)
result = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(result)

OCR and Document Analysis

# Load document image
document_image = Image.open("invoice.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": document_image},
            {"type": "text", "text": "Extract all text from this document and structure it as JSON. Explain your extraction process."}
        ]
    }
]

# Process with thinking approach
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[document_image], return_tensors="pt")
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=4096)
extracted_text = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(extracted_text)

Model Specifications

Specification Details
Parameters 4 billion
Precision BF16 (bfloat16)
Context Length 256K tokens (expandable to 1M)
Languages (OCR) 32 languages
Max Output Tokens 40,960 (vision-language), 32,768 (text-only)
Architecture Vision Transformer + Qwen3 LLM with Interleaved-MRoPE
License Apache 2.0
Release Date October 15, 2025
Format Safetensors

Recommended Generation Parameters

Vision-Language Tasks:

generation_config = {
    "top_p": 0.95,
    "top_k": 20,
    "temperature": 1.0,
    "max_new_tokens": 4096,
    "do_sample": True
}

Text-Only Tasks:

generation_config = {
    "top_p": 0.95,
    "top_k": 20,
    "temperature": 1.0,
    "presence_penalty": 1.5,
    "max_new_tokens": 32768,
    "do_sample": True
}

Performance Tips

  1. Use BF16/FP16: Maintains quality while reducing memory usage
  2. Batch Processing: Process multiple images/videos together when possible
  3. Dynamic Frame Sampling: For long videos, use dynamic sampling to reduce computational cost
  4. Gradient Checkpointing: Enable for fine-tuning with limited VRAM
  5. Flash Attention: Use Flash Attention 2 for faster inference (requires compatible hardware)
  6. Quantization: Consider FP8 or GGUF versions for deployment scenarios

Memory Optimization

# Enable memory-efficient attention
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Requires flash-attn package
)

# Or use 8-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
)

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto"
)

Variants and Fine-Tunes

Official Variants

Community Versions

  • 23+ quantized variants available on Hugging Face
  • 6+ fine-tuned versions for specialized tasks

Limitations

  • Hallucinations: May generate plausible but incorrect visual interpretations
  • Computational Requirements: Requires significant GPU resources for optimal performance
  • Video Length: While supporting 1+ hour videos, processing time increases significantly
  • Language Support: OCR optimized for 32 languages; others may have reduced accuracy
  • Chain-of-Thought Verbosity: Thinking edition produces longer outputs; use Instruct version for concise responses

License

This model is released under the Apache License 2.0.

Copyright 2025 Qwen Team, Alibaba Cloud

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Citation

If you use Qwen3-VL-4B-Thinking in your research, please cite:

@article{qwen3vl2025,
  title={Qwen3-VL: Superior Text Understanding \& Generation, Deeper Visual Perception \& Reasoning},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2025},
  url={https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking}
}

Resources

Acknowledgments

Developed by the Qwen team at Alibaba Cloud. Special thanks to the Hugging Face Transformers team for integration support and the open-source community for feedback and contributions.


Downloads: 29,713+ | Fine-Tunes: 6+ | Quantized Versions: 23+

Downloads last month
151
GGUF
Model size
4B params
Architecture
qwen3vl
Hardware compatibility
Log In to view the estimation

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including wangkanai/qwen3-vl-4b-thinking