mlfoundations
/

Gelato-30B-A3B

@@ -8,34 +8,25 @@ datasets:
   - mlfoundations-cua-dev/easyr1-103k-4MP-not-all-correct-stage-one-temp-1_1-RL-remove-pixmo-uground-seeclick # List datasets used for training
 base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
 ---
-# Introduction
-We present OLGA — the **O**nline Reinforcement **L**earning **G**rounding **A**gent, based on Qwen3-VL-30B-A3B-Instruct, a mixture-of-experts model with 3.3B activated parameters.
-OLGA is trained using a novel **data recipe** that combines existing datasets, novel data collection, automated filtering, and online reinforcement learning.<br>
-Our final training corpus consists of 100k high-quality samples, blending existing and newly collected grounding data.
-The two-stage training pipeline consists of a supervised fine-tuning (SFT) step and a subsequent online reinforcement learning (DAPO) step to deliver state-of-the-art grounding performance among open-source models.
-For ablation studies and additional insights, see our detailed [blog post]()!
 # Performance
-We evaluate on benchmarks ScreenSpot-V2, ScreenSpotPro and OS-World-G for grounding as well an agentic benchmark OS-World. For the latter we use an [evaluation harness](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/gta1/gta1_agent.py) combining our grounding model with a planner (GPT-5):
-| **Model**         | **Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
-|-------------------|:--------:|:---------------:|:-----------------:|:-----------------:|:-----------------:|
-| OpenAI CUA        | —        | ❌              | 87.9              | 23.4              |        —          |
-| Claude 3.7        | —        | ❌              | 87.6              | 27.7              |        —          |
-| JEDI-7B           | 7B       | ✅              | 91.7              | 39.5              | 54.1              |
-| SE-GUI            | 7B       | ✅              | 90.3              | 47.0              |        —          |
-| UI-TARS-1.5       | 7B       | ✅              | 89.7                 | 42.0              |  64.2 |
-| UGround-v1-7B     | 7B       | ✅              |  —                | 31.1              |   36.4        |
-| Qwen2.5-VL-32B-Instruct | 32B | ✅              |  91.9                | 48.0              |        59.6      |    |
-| UGround-v1-72B    | 72B      | ✅              |  —                | 34.5              |        —          |
-| Qwen2.5-VL-72B-Instruct | 72B | ✅              |  94.00                | 53.3              |        62.2         |
-| UI-TARS           | 72B      | ✅              | 90.3              | 38.1              |        —          |
-| GTA1             | 7B       | ✅              | 92.4             | 50.1              | 67.7              |
-| GTA1             | 32B      | ✅              | 93.2             | 53.6             |        61.9         |
-| GTA1              | 72B      | ✅              | 94.8              | 58.4             |        66.7          |
-| OLGA-30B-MoE (Ours)              | 30B      | ✅              | -             | 63.9             |      73          |
 > **Note:**
@@ -45,16 +36,12 @@ We evaluate on benchmarks ScreenSpot-V2, ScreenSpotPro and OS-World-G for ground
 > - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
 # Inference
-Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output the absolute coordinates in the format (x,y).
-## Coordinates
-Pay attention to the fact that Qwen's Autoprocessor rescales the image to multiples of 28 pixels. Therefore the absolute coordinates potentially need to be scaled back.
 ```python
 from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
 import re
-from qwen_vl_utils import process_vision_info, smart_resize
-from PIL import Image
 import requests
 from io import BytesIO
@@ -74,34 +61,32 @@ def extract_coordinates(raw_string):
     except:
         return 0,0
-def resize_image(image_path: str):
     """
-    Resize the image to a multiple of the patch size.
-    This is necessary because the model would resize the image and predict the coordinates in the resized image.
     Args:
-        image_path: str
-    Returns:
-        resized_image: PIL.Image
-        scale_x: float
-        scale_y: float
     """
-    response = requests.get(image_path)
-    img = Image.open(BytesIO(response.content))
-    width, height = img.width, img.height
-    resized_height, resized_width = smart_resize(
-        img.height,
-        img.width,
-        factor=processor.image_processor.patch_size * processor.image_processor.merge_size,
-        min_pixels=processor.image_processor.min_pixels,
-        max_pixels=processor.image_processor.max_pixels,
-    )
-    resized_image = img.resize((resized_width, resized_height))
-    scale_x, scale_y = width / resized_width, height / resized_height
-    return resized_image, scale_x, scale_y
 # Load the model and processor
-MODEL_PATH = "mlfoundations-cua-dev/Gelato-30B"
 model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
     MODEL_PATH,
@@ -110,49 +95,48 @@ model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
 )
 processor = AutoProcessor.from_pretrained(
-    MODEL_PATH
 )
 # Prepare messages
-SYSTEM_PROMPT = '''
 You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
 Output the coordinate pair exactly:
 (x,y)
 '''
-SYSTEM_PROMPT=SYSTEM_PROMPT.strip()
-resized_image, scale_x, scale_y = resize_image("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg")
 messages = [
-    {
-        "role": "system",
-        "content": [
-            {
-                "type": "text",
-                "text": SYSTEM_PROMPT
-            }
-        ]
-    },
     {
         "role": "user",
         "content": [
             {
                 "type": "image",
-                "image": resized_image,
             },
-            {"type": "text", "text": "Describe this image."},
         ],
     }
 ]
 inputs = processor.apply_chat_template(
     messages,
     tokenize=True,
     add_generation_prompt=True,
     return_dict=True,
     return_tensors="pt"
-)
 # Inference: Generation of the output
 generated_ids = model.generate(**inputs, max_new_tokens=128)
@@ -164,8 +148,9 @@ output_text = processor.batch_decode(
 )
 # Extract the coordinates from the output text
 pred_x, pred_y = extract_coordinates(output_text[0])
-pred_x = pred_x * scale_x
-pred_y = pred_y * scale_y
-print(f"Predicted coordinates: ({pred_x}, {pred_y})")
-```

   - mlfoundations-cua-dev/easyr1-103k-4MP-not-all-correct-stage-one-temp-1_1-RL-remove-pixmo-uground-seeclick # List datasets used for training
 base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
 ---
+# 🍨 Gelato — From Data Curation to Reinforcement Learning: Building a Strong Grounding Model for Computer-Use Agents
+[🍨 **Gelato-30B-A3B (model)**](https://huggingface.co/mlfoundations/Gelato-30B-A3B) | [🖱️ **Click-100k (dataset)**](https://huggingface.co/datasets/mlfoundations/clicks-100k) | [🔗 **Training Instructions**](./training_configs) | [📈 **Evaluation**](./evaluation)
+![Figure 1: Gelato-30B-A3B](assets/gelato-fig1.png)
+We are releasing [**🍨 Gelato-30B-A3B**](https://huggingface.co/mlfoundations/Gelato-30B-A3B), a state-of-the-art grounding model for GUI computer-use tasks! Gelato is trained on our open-sourced [**Click-100k**](https://huggingface.co/datasets/mlfoundations/clicks-100k) dataset and achieves **63.88% accuracy on ScreenSpot-Pro**<sup>[[3](#ref-screenspot-pro)]</sup> and **67.19% / 73.40% on OS-World-G / OS-World-G (Refined)**<sup>[[4](#ref-jedi)]</sup>, surpassing prior specialized computer grounding models like GTA1-32B <sup>[[5](#ref-gta1)]</sup> and much larger VLMs including Qwen3-VL-235B-A22B-Instruct <sup>[[10](#ref-qwen3vl)]</sup>. When combined with GPT-5, Gelato enables frontier-level agentic performance—placing *TBD* on the [OS-World leaderboard](https://github.com/mlfoundations/grounding-model-os-world) at *TBD* accuracy.
 # Performance
+Gelato-30B-A3B outperforms the SoTA specialized computer grounding model, GTA1-32B, and larger VLMs on the ScreenSpot-Pro and OS-World-G grounding benchmarks. When paired with GPT-5, Gelato as a computer-use agent attains *TBD* success rate on OS-World placing it *TBD* on the leaderboard.
+| **Model** | **Total Size** | **Activated Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
+|------------|:--------------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:--------------:|
+| Qwen3-VL-30B-A3B-Instruct | 30 B | 3.3 B | ✅ | – | – | – |
+| Qwen3-VL-235B-A22B-Instruct | 235 B | 22 B | ✅ | – | – | – |
+| OpenCUA-72B | 72 B | – | ✅ | – | – | – |
+| GTA1-32B | 32 B | – | ✅ | – | – | – |
+| Gelato-30B-A3B | 30 B | 3.3 B | ✅ | – | – | – |
 > **Note:**
 > - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
 # Inference
+Below is a code snippet demonstrating how to ground using our model. Given an image and an instruction, we output normalized coordinates in the range [0,1000].
 ```python
 from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
 import re
+from PIL import Image, ImageDraw
 import requests
 from io import BytesIO
     except:
         return 0,0
+def visualize_prediction(img, pred_x, pred_y, img_width, img_height):
     """
+    Visualize the predicted coordinates on the image.
     Args:
+        img: PIL.Image.Image
+        pred_x: float
+        pred_y: float
+        img_width: int
+        img_height: int
     """
+    pred_x = int((pred_x * img_width)/1000)
+    pred_y = int((pred_y * img_height)/1000)
+    draw = ImageDraw.Draw(img)
+    r = 20
+    draw.ellipse((pred_x - r, pred_y - r, pred_x + r, pred_y + r), outline="green", width=2)
+    cross_len = 6
+    draw.line((pred_x - cross_len, pred_y, pred_x + cross_len, pred_y), fill="green", width=2)
+    draw.line((pred_x, pred_y - cross_len, pred_x, pred_y + cross_len), fill="green", width=2)
+    img.save("predicted_coordinates.png")
+    print(f"Predicted coordinates: ({pred_x}, {pred_y})")
 # Load the model and processor
+MODEL_PATH = "mlfoundations-cua-dev/Gelato-30B-A3B"
 model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
     MODEL_PATH,
 )
 processor = AutoProcessor.from_pretrained(
+    MODEL_PATH,
+    max_pixels=10*7 # 10MP
 )
+url = "https://github.com/QwenLM/Qwen3-VL/raw/main/cookbooks/assets/computer_use/computer_use1.jpeg"
+response = requests.get(url)
+print(response.status_code)
+print(response.headers.get("Content-Type"))
+img = Image.open(BytesIO(response.content))
+img_width, img_height = img.size
 # Prepare messages
+PROMPT = '''
 You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
 Output the coordinate pair exactly:
 (x,y)
 '''
+PROMPT = PROMPT.strip()
 messages = [
     {
         "role": "user",
         "content": [
+            {"type": "text", "text": PROMPT},
             {
                 "type": "image",
+                "image": img,
             },
+            {"type": "text", "text": "Reload the cache."},
         ],
     }
 ]
+device = next(model.parameters()).device
 inputs = processor.apply_chat_template(
     messages,
     tokenize=True,
     add_generation_prompt=True,
     return_dict=True,
     return_tensors="pt"
+).to(device)
 # Inference: Generation of the output
 generated_ids = model.generate(**inputs, max_new_tokens=128)
 )
 # Extract the coordinates from the output text
+print(f"Model output: {output_text[0]}")
 pred_x, pred_y = extract_coordinates(output_text[0])
+# Calculate the absolute coordinates from normalized coordinates
+visualize_prediction(img, pred_x, pred_y, img_width, img_height)
+```