zeekay commited on
Commit
1906c78
Β·
verified Β·
1 Parent(s): 97ee200

Initialize zen-vl-4b-agent model card

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision-language
5
+ - multimodal
6
+ - function-calling
7
+ - visual-agents
8
+ - qwen3-vl
9
+ - zen
10
+ language:
11
+ - en
12
+ - multilingual
13
+ base_model:
14
+ - Qwen/Qwen3-VL-4B-Instruct
15
+ library_name: transformers
16
+ pipeline_tag: image-text-to-text
17
+ ---
18
+
19
+ # Zen Vl 4B Agent
20
+
21
+ Zen VL 4B Agent - Vision-language model with function calling and tool use capabilities
22
+
23
+ ## Model Details
24
+
25
+ - **Architecture**: Qwen3-VL
26
+ - **Parameters**: 4B
27
+ - **Context Window**: 256K tokens (expandable to 1M)
28
+ - **License**: Apache 2.0
29
+ - **Training**: Fine-tuned with Zen identity and function calling
30
+
31
+ ## Capabilities
32
+
33
+ - 🎨 **Visual Understanding**: Image analysis, video comprehension, spatial reasoning
34
+ - πŸ“ **OCR**: Text extraction in 32 languages
35
+ - 🧠 **Multimodal Reasoning**: STEM, math, code generation
36
+ - πŸ› οΈ **Function Calling**: Tool use with visual context
37
+ - πŸ€– **Visual Agents**: GUI interaction, parameter extraction
38
+
39
+ ## Usage
40
+
41
+ ```python
42
+ from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
43
+ from PIL import Image
44
+
45
+ # Load model
46
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
47
+ "zenlm/zen-vl-4b-agent",
48
+ device_map="auto"
49
+ )
50
+ processor = AutoProcessor.from_pretrained("zenlm/zen-vl-4b-agent")
51
+
52
+ # Process image
53
+ image = Image.open("example.jpg")
54
+ prompt = "What's in this image?"
55
+
56
+ messages = [{"role": "user", "content": prompt}]
57
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
58
+ inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
59
+
60
+ # Generate
61
+ outputs = model.generate(**inputs, max_new_tokens=256)
62
+ response = processor.decode(outputs[0], skip_special_tokens=True)
63
+ print(response)
64
+ ```
65
+
66
+ ## Links
67
+
68
+ - 🌐 **Website**: [zenlm.org](https://zenlm.org)
69
+ - πŸ“š **GitHub**: [zenlm/zen-vl](https://github.com/zenlm/zen-vl)
70
+ - πŸ“„ **Paper**: Coming soon
71
+ - πŸ€— **Model Family**: [zenlm](https://huggingface.co/zenlm)
72
+
73
+ ## Citation
74
+
75
+ ```bibtex
76
+ @misc{zenvl2025,
77
+ title={Zen VL: Vision-Language Models with Integrated Function Calling},
78
+ author={Hanzo AI Team},
79
+ year={2025},
80
+ publisher={Zen Language Models},
81
+ url={https://github.com/zenlm/zen-vl}
82
+ }
83
+ ```
84
+
85
+ ## License
86
+
87
+ Apache 2.0
88
+
89
+ ---
90
+
91
+ Created by [Hanzo AI](https://hanzo.ai) for the Zen model family.