Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -14,7 +14,9 @@ base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
|
|
| 14 |
|
| 15 |

|
| 16 |
|
| 17 |
-
We are releasing [**π¨ Gelato-30B-A3B**](https://huggingface.co/mlfoundations/Gelato-30B-A3B), a state-of-the-art grounding model for GUI computer-use tasks! Gelato is trained on our open-sourced [**Click-100k**](https://huggingface.co/datasets/mlfoundations/clicks-100k) dataset and achieves **63.88% accuracy on ScreenSpot-Pro**<sup>[[3](#ref-screenspot-pro)]</sup> and **67.19% / 73.40% on OS-World-G / OS-World-G (Refined)**<sup>[[4](#ref-jedi)]</sup>, surpassing prior specialized computer grounding models like GTA1-32B <sup>[[5](#ref-gta1)]</sup> and much larger VLMs including Qwen3-VL-235B-A22B-Instruct <sup>[[10](#ref-qwen3vl)]</sup>. When combined with GPT-5, Gelato enables frontier-level agentic performanceβplacing *TBD* on the [OS-World leaderboard](https://github.com/mlfoundations/grounding-model-os-world) at *TBD* accuracy.
|
|
|
|
|
|
|
| 18 |
|
| 19 |
# Performance
|
| 20 |
|
|
@@ -23,17 +25,10 @@ Gelato-30B-A3B outperforms the SoTA specialized computer grounding model, GTA1-3
|
|
| 23 |
| **Model** | **Total Size** | **Activated Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
|
| 24 |
|------------|:--------------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:--------------:|
|
| 25 |
| Qwen3-VL-30B-A3B-Instruct | 30 B | 3.3 B | β
| β | β | β |
|
| 26 |
-
| Qwen3-VL-235B-A22B-Instruct | 235 B | 22 B | β
|
|
| 27 |
-
| OpenCUA-72B | 72 B | β | β
| β |
|
| 28 |
| GTA1-32B | 32 B | β | β
| β | β | β |
|
| 29 |
-
| Gelato-30B-A3B | 30 B | 3.3 B | β
| β |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
> **Note:**
|
| 33 |
-
> - Model size is indicated in billions (B) of parameters.
|
| 34 |
-
> - A dash (β) denotes results that are currently unavailable.
|
| 35 |
-
> - Qwen2.5-VL-7B-Instruct and Qwen3-VL-30B-A3B-Instruct are applied as our baseline models.
|
| 36 |
-
> - β indicates the performance improvement (β) of our model compared to its baseline.
|
| 37 |
|
| 38 |
# Inference
|
| 39 |
Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output normalized coordinates in the range [0,1000].
|
|
|
|
| 14 |
|
| 15 |

|
| 16 |
|
| 17 |
+
We are releasing [**π¨ Gelato-30B-A3B**](https://huggingface.co/mlfoundations/Gelato-30B-A3B), a state-of-the-art grounding model for GUI computer-use tasks! Gelato is trained on our open-sourced [**Click-100k**](https://huggingface.co/datasets/mlfoundations/clicks-100k) dataset and achieves **63.88% accuracy on ScreenSpot-Pro**<sup>[[3](#ref-screenspot-pro)]</sup> and **67.19% / 73.40% on OS-World-G / OS-World-G (Refined)**<sup>[[4](#ref-jedi)]</sup>, surpassing prior specialized computer grounding models like GTA1-32B <sup>[[5](#ref-gta1)]</sup> and much larger VLMs including Qwen3-VL-235B-A22B-Instruct <sup>[[10](#ref-qwen3vl)]</sup>. When combined with GPT-5, Gelato enables frontier-level agentic performanceβplacing *TBD* on the [OS-World leaderboard](https://github.com/mlfoundations/grounding-model-os-world) at *TBD* accuracy.
|
| 18 |
+
|
| 19 |
+
For details on data curation and training refer to our [blog post](https://huggingface.co/mlfoundations-cua-dev/Gelato-30B-A3B).
|
| 20 |
|
| 21 |
# Performance
|
| 22 |
|
|
|
|
| 25 |
| **Model** | **Total Size** | **Activated Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
|
| 26 |
|------------|:--------------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:--------------:|
|
| 27 |
| Qwen3-VL-30B-A3B-Instruct | 30 B | 3.3 B | β
| β | β | β |
|
| 28 |
+
| Qwen3-VL-235B-A22B-Instruct | 235 B | 22 B | β
| - | 62.0 | 66.7 |
|
| 29 |
+
| OpenCUA-72B | 72 B | β | β
| β | 60.8 | 59.2 |
|
| 30 |
| GTA1-32B | 32 B | β | β
| β | β | β |
|
| 31 |
+
| Gelato-30B-A3B | 30 B | 3.3 B | β
| β | 63.88 | 73.40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
# Inference
|
| 34 |
Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output normalized coordinates in the range [0,1000].
|