GUIrilla
/

GUIrilla-See-0.7B

Model card Files Files and versions Community

GUIrilla commited on May 16

Commit

0b4bb7e

·

verified ·

1 Parent(s): 0d95566

Update README.md

Files changed (1) hide show

README.md +97 -2

README.md CHANGED Viewed

@@ -6,8 +6,103 @@ tags:
 license: mit
 base_model:
 - microsoft/Florence-2-large
 ---
-# Model Card for GUIrilla-See-0.7B
-This model is a fine-tuned version of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large).

 license: mit
 base_model:
 - microsoft/Florence-2-large
+datasets:
+- GUIrilla/GUIrilla-Task
 ---
+# GUIrilla-See-0.7B
+*Lightweight vision–language model for GUI element localisation*
+---
+## Summary
+**GUIrilla-See-0.7B** is a 0.7-billion-parameter model derived from **Florence 2-large** and fine-tuned for **open-vocabulary detection** in graphical user-interface (GUI) screenshots.
+Given an image and a free-form textual description, the model returns either
+* the bounding box of the best-matching element, or
+* a polygon mask, when a bounding box is unavailable.
+The model is intended for research on lightweight GUI agents, automated testing, and accessibility tools where a small footprint is preferred over the larger counterpart.
+---
+## Quick-start
+```python
+import torch, PIL.Image as Image
+from transformers import AutoModelForCausalLM, AutoProcessor
+# --- load pipeline -----------------------------------------------------------
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = "GUIrilla/GUIrilla-See-0.7B"        # 0.7 B weights
+dtype = torch.bfloat16 if device == "cuda" else torch.float32
+model = AutoModelForCausalLM.from_pretrained(
+    model_name, torch_dtype=dtype, trust_remote_code=True
+).to(device)
+processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
+# --- inference ---------------------------------------------------------------
+image = Image.open("screenshot.png").convert("RGB")
+task_prompt = "<OPEN_VOCABULARY_DETECTION>"
+text_query  = "button with the label “Submit”"
+prompt = task_prompt + text_query
+inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device, dtype)
+with torch.no_grad():
+    ids = model.generate(
+        input_ids   = inputs["input_ids"],
+        pixel_values= inputs["pixel_values"],
+        max_new_tokens = 1024,
+        num_beams      = 3,
+        do_sample      = False,
+        early_stopping = False,
+    )
+decoded = processor.batch_decode(ids, skip_special_tokens=False)[0]
+result  = processor.post_process_generation(
+    decoded, task=task_prompt, image_size=image.size
+)["<OPEN_VOCABULARY_DETECTION>"]
+```
+---
+## Training Data
+Trained on [GUIrilla-Task](https://huggingface.co/datasets/GUIrilla/GUIrilla-Task).
+* **Train data:** 25,606 tasks across 881 macOS applications (10% of apps from it for validation)
+* **Test data:**  1,565 tasks across 227 macOS applications
+---
+## Training Procedure
+* 4 epochs LoRA fine-tuning on 1 × A100 40 GB.
+* Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 5 e-6 with 0.01 warm up ratio.
+---
+## Evaluation
+| Split | Success Rate % |
+| ----- | ---------------|
+| Test  | **53.55**      |
+---
+## Ethical & Safety Notes
+* Always sandbox or use confirmation steps when connecting the model to real GUIs.
+* Screenshots may reveal sensitive data – ensure compliance with privacy regulations.
+---
+## License
+MIT (see `LICENSE`).