mlfoundations
/

Gelato-30B-A3B

@@ -18,7 +18,7 @@ For ablation studies and additional insights, see our detailed [blog post]()!
 # Performance
-We evaluate on benchmarks ScreenSpot-V2, ScreenSpotPro and OS-World-G for grounding as well an agentic benchmark OS-World. For the latter we use an evaluation harness combining our grounding model with a planner (GPT-5) inspired by GTA1 Test-Time Scaling GUI Agents.
 | **Model**         | **Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
 |-------------------|:--------:|:---------------:|:-----------------:|:-----------------:|:-----------------:|

 # Performance
+We evaluate on benchmarks ScreenSpot-V2, ScreenSpotPro and OS-World-G for grounding as well an agentic benchmark OS-World. For the latter we use an [evaluation harness](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/gta1/gta1_agent.py) combining our grounding model with a planner (GPT-5):
 | **Model**         | **Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
 |-------------------|:--------:|:---------------:|:-----------------:|:-----------------:|:-----------------:|