aylinakkus commited on
Commit
ea315df
Β·
verified Β·
1 Parent(s): 0f0ae33

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +6 -11
README.md CHANGED
@@ -14,7 +14,9 @@ base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
14
 
15
  ![Figure 1: Gelato-30B-A3B](assets/gelato-fig1.png)
16
 
17
- We are releasing [**🍨 Gelato-30B-A3B**](https://huggingface.co/mlfoundations/Gelato-30B-A3B), a state-of-the-art grounding model for GUI computer-use tasks! Gelato is trained on our open-sourced [**Click-100k**](https://huggingface.co/datasets/mlfoundations/clicks-100k) dataset and achieves **63.88% accuracy on ScreenSpot-Pro**<sup>[[3](#ref-screenspot-pro)]</sup> and **67.19% / 73.40% on OS-World-G / OS-World-G (Refined)**<sup>[[4](#ref-jedi)]</sup>, surpassing prior specialized computer grounding models like GTA1-32B <sup>[[5](#ref-gta1)]</sup> and much larger VLMs including Qwen3-VL-235B-A22B-Instruct <sup>[[10](#ref-qwen3vl)]</sup>. When combined with GPT-5, Gelato enables frontier-level agentic performanceβ€”placing *TBD* on the [OS-World leaderboard](https://github.com/mlfoundations/grounding-model-os-world) at *TBD* accuracy.
 
 
18
 
19
  # Performance
20
 
@@ -23,17 +25,10 @@ Gelato-30B-A3B outperforms the SoTA specialized computer grounding model, GTA1-3
23
  | **Model** | **Total Size** | **Activated Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
24
  |------------|:--------------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:--------------:|
25
  | Qwen3-VL-30B-A3B-Instruct | 30 B | 3.3 B | βœ… | – | – | – |
26
- | Qwen3-VL-235B-A22B-Instruct | 235 B | 22 B | βœ… | – | – | – |
27
- | OpenCUA-72B | 72 B | – | βœ… | – | – | – |
28
  | GTA1-32B | 32 B | – | βœ… | – | – | – |
29
- | Gelato-30B-A3B | 30 B | 3.3 B | βœ… | – | – | – |
30
-
31
-
32
- > **Note:**
33
- > - Model size is indicated in billions (B) of parameters.
34
- > - A dash (β€”) denotes results that are currently unavailable.
35
- > - Qwen2.5-VL-7B-Instruct and Qwen3-VL-30B-A3B-Instruct are applied as our baseline models.
36
- > - βˆ† indicates the performance improvement (βˆ†) of our model compared to its baseline.
37
 
38
  # Inference
39
  Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output normalized coordinates in the range [0,1000].
 
14
 
15
  ![Figure 1: Gelato-30B-A3B](assets/gelato-fig1.png)
16
 
17
+ We are releasing [**🍨 Gelato-30B-A3B**](https://huggingface.co/mlfoundations/Gelato-30B-A3B), a state-of-the-art grounding model for GUI computer-use tasks! Gelato is trained on our open-sourced [**Click-100k**](https://huggingface.co/datasets/mlfoundations/clicks-100k) dataset and achieves **63.88% accuracy on ScreenSpot-Pro**<sup>[[3](#ref-screenspot-pro)]</sup> and **67.19% / 73.40% on OS-World-G / OS-World-G (Refined)**<sup>[[4](#ref-jedi)]</sup>, surpassing prior specialized computer grounding models like GTA1-32B <sup>[[5](#ref-gta1)]</sup> and much larger VLMs including Qwen3-VL-235B-A22B-Instruct <sup>[[10](#ref-qwen3vl)]</sup>. When combined with GPT-5, Gelato enables frontier-level agentic performanceβ€”placing *TBD* on the [OS-World leaderboard](https://github.com/mlfoundations/grounding-model-os-world) at *TBD* accuracy.
18
+
19
+ For details on data curation and training refer to our [blog post](https://huggingface.co/mlfoundations-cua-dev/Gelato-30B-A3B).
20
 
21
  # Performance
22
 
 
25
  | **Model** | **Total Size** | **Activated Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
26
  |------------|:--------------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:--------------:|
27
  | Qwen3-VL-30B-A3B-Instruct | 30 B | 3.3 B | βœ… | – | – | – |
28
+ | Qwen3-VL-235B-A22B-Instruct | 235 B | 22 B | βœ… | - | 62.0 | 66.7 |
29
+ | OpenCUA-72B | 72 B | – | βœ… | – | 60.8 | 59.2 |
30
  | GTA1-32B | 32 B | – | βœ… | – | – | – |
31
+ | Gelato-30B-A3B | 30 B | 3.3 B | βœ… | – | 63.88 | 73.40 |
 
 
 
 
 
 
 
32
 
33
  # Inference
34
  Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output normalized coordinates in the range [0,1000].