mlfoundations
/

Gelato-30B-A3B

@@ -14,7 +14,9 @@ base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
 ![Figure 1: Gelato-30B-A3B](assets/gelato-fig1.png)
-We are releasing [**🍨 Gelato-30B-A3B**](https://huggingface.co/mlfoundations/Gelato-30B-A3B), a state-of-the-art grounding model for GUI computer-use tasks! Gelato is trained on our open-sourced [**Click-100k**](https://huggingface.co/datasets/mlfoundations/clicks-100k) dataset and achieves **63.88% accuracy on ScreenSpot-Pro**<sup>[[3](#ref-screenspot-pro)]</sup> and **67.19% / 73.40% on OS-World-G / OS-World-G (Refined)**<sup>[[4](#ref-jedi)]</sup>, surpassing prior specialized computer grounding models like GTA1-32B <sup>[[5](#ref-gta1)]</sup> and much larger VLMs including Qwen3-VL-235B-A22B-Instruct <sup>[[10](#ref-qwen3vl)]</sup>. When combined with GPT-5, Gelato enables frontier-level agentic performance—placing *TBD* on the [OS-World leaderboard](https://github.com/mlfoundations/grounding-model-os-world) at *TBD* accuracy.
 # Performance
@@ -23,17 +25,10 @@ Gelato-30B-A3B outperforms the SoTA specialized computer grounding model, GTA1-3
 | **Model** | **Total Size** | **Activated Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
 |------------|:--------------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:--------------:|
 | Qwen3-VL-30B-A3B-Instruct | 30 B | 3.3 B | ✅ | – | – | – |
-| Qwen3-VL-235B-A22B-Instruct | 235 B | 22 B | ✅ | – | – | – |
-| OpenCUA-72B | 72 B | – | ✅ | – | – | – |
 | GTA1-32B | 32 B | – | ✅ | – | – | – |
-| Gelato-30B-A3B | 30 B | 3.3 B | ✅ | – | – | – |
-> **Note:**
-> - Model size is indicated in billions (B) of parameters.
-> - A dash (—) denotes results that are currently unavailable.
-> - Qwen2.5-VL-7B-Instruct and Qwen3-VL-30B-A3B-Instruct are applied as our baseline models.
-> - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
 # Inference
 Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output normalized coordinates in the range [0,1000].

 ![Figure 1: Gelato-30B-A3B](assets/gelato-fig1.png)
+We are releasing [**🍨 Gelato-30B-A3B**](https://huggingface.co/mlfoundations/Gelato-30B-A3B), a state-of-the-art grounding model for GUI computer-use tasks! Gelato is trained on our open-sourced [**Click-100k**](https://huggingface.co/datasets/mlfoundations/clicks-100k) dataset and achieves **63.88% accuracy on ScreenSpot-Pro**<sup>[[3](#ref-screenspot-pro)]</sup> and **67.19% / 73.40% on OS-World-G / OS-World-G (Refined)**<sup>[[4](#ref-jedi)]</sup>, surpassing prior specialized computer grounding models like GTA1-32B <sup>[[5](#ref-gta1)]</sup> and much larger VLMs including Qwen3-VL-235B-A22B-Instruct <sup>[[10](#ref-qwen3vl)]</sup>. When combined with GPT-5, Gelato enables frontier-level agentic performance—placing *TBD* on the [OS-World leaderboard](https://github.com/mlfoundations/grounding-model-os-world) at *TBD* accuracy.
+For details on data curation and training refer to our [blog post](https://huggingface.co/mlfoundations-cua-dev/Gelato-30B-A3B).
 # Performance
 | **Model** | **Total Size** | **Activated Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
 |------------|:--------------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:--------------:|
 | Qwen3-VL-30B-A3B-Instruct | 30 B | 3.3 B | ✅ | – | – | – |
+| Qwen3-VL-235B-A22B-Instruct | 235 B | 22 B | ✅ | - | 62.0 | 66.7 |
+| OpenCUA-72B | 72 B | – | ✅ | – | 60.8 | 59.2 |
 | GTA1-32B | 32 B | – | ✅ | – | – | – |
+| Gelato-30B-A3B | 30 B | 3.3 B | ✅ | – | 63.88 | 73.40 |
 # Inference
 Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output normalized coordinates in the range [0,1000].