inclusionAI
/

UI-Venus-Ground-72B

@@ -5,7 +5,9 @@ library_name: transformers
 ---
 ### UI-Venus
-This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
@@ -194,57 +196,6 @@ Scores are in percentage (%). `T` = Text, `I` = Icon.
 > 🔝 **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
-### Results on AndroidWorld
-This is the compressed package of validation trajectories for **AndroidWorld**, including execution logs and navigation paths.
-📥 Download: [UI-Venus-androidworld.zip](vis_androidworld/UI-Venus-androidworld.zip)
-| Models | With Planner | A11y Tree | Screenshot | Success Rate (pass@1) |
-|--------|--------------|-----------|------------|------------------------|
-| **Closed-source Models** | | | | |
-| GPT-4o| ❌ | ✅ | ❌ | 30.6 |
-| ScaleTrack| ❌ | ✅ | ❌ | 44.0 |
-| SeedVL-1.5 | ❌ | ✅ | ✅ | 62.1 |
-| UI-TARS-1.5 | ❌ | ❌ | ✅ | 64.2 |
-| **Open-source Models** | | | | |
-| GUI-Critic-R1-7B | ❌ | ✅ | ✅ | 27.6 |
-| Qwen2.5-VL-72B* | ❌ | ❌ | ✅ | 35.0 |
-| UGround | ✅ | ❌ | ✅ | 44.0 |
-| Aria-UI | ✅ | ❌ | ✅ | 44.8 |
-| UI-TARS-72B | ❌ | ❌ | ✅ | 46.6 |
-| GLM-4.5v | ❌ | ❌ | ✅ | 57.0 |
-| **Ours** | | | | |
-| UI-Venus-Navi-7B | ❌ | ❌ | ✅ | **49.1** |
-| UI-Venus-Navi-72B | ❌ | ❌ | ✅ | **65.9** |
-> **Table:** Performance comparison on **AndroidWorld** for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.
-### Results on AndroidControl and GUI-Odyssey
-| Models | AndroidControl-Low<br>Type Acc. | AndroidControl-Low<br>Step SR | AndroidControl-High<br>Type Acc. | AndroidControl-High<br>Step SR | GUI-Odyssey<br>Type Acc. | GUI-Odyssey<br>Step SR |
-|--------|-------------------------------|-----------------------------|-------------------------------|-----------------------------|------------------------|----------------------|
-| **Closed-source Models** | | | | | | |
-| GPT-4o | 74.3 | 19.4 | 66.3 | 20.8 | 34.3 | 3.3 |
-| **Open Source Models** | | | | | | |
-| Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 59.5 | 46.3 |
-| SeeClick | 93.0 | 75.0 | 82.9 | 59.1 | 71.0 | 53.9 |
-| OS-Atlas-7B | 93.6 | 85.2 | 85.2 | 71.2 | 84.5 | 62.0 |
-| Aguvis-7B| - | 80.5 | - | 61.5 | - | - |
-| Aguvis-72B| - | 84.4 | - | 66.4 | - | - |
-| OS-Genesis-7B | 90.7 | 74.2 | 66.2 | 44.5 | - | - |
-| UI-TARS-7B| 98.0 | 90.8 | 83.7 | 72.5 | 94.6 | 87.0 |
-| UI-TARS-72B| **98.1** | 91.3 | 85.2 | 74.7 | **95.4** | **88.6** |
-| GUI-R1-7B| 85.2 | 66.5 | 71.6 | 51.7 | 65.5 | 38.8 |
-| NaviMaster-7B | 85.6 | 69.9 | 72.9 | 54.0 | - | - |
-| UI-AGILE-7B | 87.7 | 77.6 | 80.1 | 60.6 | - | - |
-| AgentCPM-GUI | 94.4 | 90.2 | 77.7 | 69.2 | 90.0 | 75.0 |
-| **Ours** | | | | | | |
-| UI-Venus-Navi-7B | 97.1 | 92.4 | **86.5** | 76.1 | 87.3 | 71.5 |
-| UI-Venus-Navi-72B | 96.7 | **92.9** | 85.9 | **77.2** | 87.2 | 72.4 |
-> **Table:** Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.
 # Citation
 Please consider citing if you find our work useful:
 ```plain

 ---
 ### UI-Venus
+This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833).
+UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
 > 🔝 **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
 # Citation
 Please consider citing if you find our work useful:
 ```plain