Safetensors
qwen2_5_vl
zengw commited on
Commit
4132567
ยท
verified ยท
1 Parent(s): 1eaee35

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -3
README.md CHANGED
@@ -1,3 +1,127 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ ### GUI-G2-3B
5
+
6
+ This repository contains the GUI-G2-3B model from the paper [GUI-Gยฒ: Gaussian Reward Modeling
7
+ for GUI Grounding](https://arxiv.org/abs/2507.15846). We provided more inference details on the github quick start. We will update GUI-G2-3B results on GUI Grounding benchmark.
8
+
9
+ [![Huggingface Paper](https://img.shields.io/badge/Paper-2507.15846-ffcc00?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/papers/2507.15846)
10
+ [![Paper](https://img.shields.io/badge/Paper-TBA-A42C25?style=for-the-badge)](https://arxiv.org/abs/2507.15846)
11
+ [![alphaXiv](https://img.shields.io/badge/alphaXiv-2507.15846-1f8ceb?style=for-the-badge)](https://www.alphaxiv.org/abs/2507.15846)
12
+ [![Project](https://img.shields.io/badge/Project-Page-007ec6?style=for-the-badge)](https://zju-real.github.io/GUI-G2)
13
+ [![GitHub](https://img.shields.io/badge/Code-GUI--G2-000000?style=for-the-badge&logo=github)](https://github.com/zju-real/GUI-G2)
14
+
15
+ ### Model Description
16
+
17
+ The model is based on `Qwen2.5-VL-3B-Instruct` and is fine-tuned using our proposed Gaussian dense reward framework framework.
18
+
19
+ - ๐Ÿ’ก**Gaussian Point & Coverage Rewards**: Encourage accurate, spatially-aligned clicks.
20
+
21
+ * ๐Ÿ“ **Adaptive Variance Mechanism**: Adjusts reward granularity based on element scale.
22
+ * ๐ŸŒ **Dense Learning Signals**: Smooth gradients outperform binary RL rewards in early-stage learning.
23
+ * ๐Ÿ“Š **State-of-the-art Performance** on ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro datasets.
24
+
25
+ ### Quick Start
26
+
27
+ First, install the required dependencies:
28
+
29
+ ```python
30
+ pip install transformers==4.49.0 qwen-vl-utils
31
+ ```
32
+
33
+ ```
34
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
35
+ from qwen_vl_utils import process_vision_info
36
+
37
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
38
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
39
+ "inclusionAI/GUI-G2-3B",
40
+ torch_dtype=torch.bfloat16,
41
+ attn_implementation="flash_attention_2",
42
+ device_map="auto")
43
+
44
+ processor = AutoProcessor.from_pretrained("inclusionAI/GUI-G2-3B")
45
+ image_path = ''
46
+ instruction = ''
47
+
48
+ messages = [
49
+ {
50
+ "role": "user",
51
+ "content": [
52
+ {
53
+ "type": "image",
54
+ "image": "image_path",
55
+ },
56
+ {"type": "text", "text": instruction},
57
+ ],
58
+ }
59
+ ]
60
+
61
+ # Preparation for inference
62
+ text = processor.apply_chat_template(
63
+ messages, tokenize=False, add_generation_prompt=True
64
+ )
65
+ image_inputs, video_inputs = process_vision_info(messages)
66
+ inputs = processor(
67
+ text=[text],
68
+ images=image_inputs,
69
+ videos=video_inputs,
70
+ padding=True,
71
+ return_tensors="pt",
72
+ )
73
+ inputs = inputs.to(model.device)
74
+
75
+ # Inference: Generation of the output
76
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
77
+ generated_ids_trimmed = [
78
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
79
+ ]
80
+ output_text = processor.batch_decode(
81
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
82
+ )
83
+ print(output_text)
84
+ ```
85
+
86
+ ### ๐Ÿ“Š Results on ScreenSpot-v2
87
+
88
+ | **Model** | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
89
+ | -------------------- | --------------- | --------------- | ---------------- | ---------------- | ------------ | ------------ | -------- |
90
+ | GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 |
91
+ | Qwen2.5-VL-3B | 93.4 | 73.5 | 88.1 | 58.6 | 88.0 | 71.4 | 80.9 |
92
+ | Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 |
93
+ | SeeClick-9.6B | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | 55.1 |
94
+ | UGround-7B | 75.1 | 84.5 | 85.1 | 61.4 | 84.6 | 71.9 | 76.3 |
95
+ | OS-Atlas-7B | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 |
96
+ | UI-TARS-2B | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | 84.7 |
97
+ | UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 |
98
+ | UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 |
99
+ | JEDI-7B | 96.9 | 87.2 | 95.9 | 87.9 | 94.4 | 84.2 | 91.7 |
100
+ | GUI-Actor-7B | 97.6 | 88.2 | 96.9 | 85.7 | 93.2 | 86.7 | 92.1 |
101
+ | UI-R1-3B | 96.2 | 84.3 | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 |
102
+ | UI-R1-E-3B | 98.2 | 83.9 | 94.8 | 75.0 | 93.2 | 83.7 | 89.5 |
103
+ | SE-GUI-7B | - | - | - | - | - | - | 90.3 |
104
+ | LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 |
105
+ | **GUI-Gยฒ-7B (Ours)** | **98.3** | **91.9** | **95.4** | **89.3** | **94.0** | **87.7** | **93.3** |
106
+
107
+ ---
108
+
109
+ ### ๐Ÿ™ Acknowledgement
110
+
111
+ The RL Training code build from [VLM-R1 project](https://github.com/om-ai-lab/VLM-R1).
112
+
113
+ ### ๐Ÿ“„ Citation
114
+
115
+ If you use GUI-Gยฒ, please cite our work:
116
+
117
+ ```bibtex
118
+ @misc{tang2025guig2gaussianrewardmodeling,
119
+ title={GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding},
120
+ author={Fei Tang and Zhangxuan Gu and Zhengxi Lu and Xuyang Liu and Shuheng Shen and Changhua Meng and Wen Wang and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
121
+ year={2025},
122
+ eprint={2507.15846},
123
+ archivePrefix={arXiv},
124
+ primaryClass={cs.LG},
125
+ url={https://arxiv.org/abs/2507.15846},
126
+ }
127
+ ```