GUIrilla commited on
Commit
0b4bb7e
·
verified ·
1 Parent(s): 0d95566

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -2
README.md CHANGED
@@ -6,8 +6,103 @@ tags:
6
  license: mit
7
  base_model:
8
  - microsoft/Florence-2-large
 
 
9
  ---
10
 
11
- # Model Card for GUIrilla-See-0.7B
12
 
13
- This model is a fine-tuned version of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  license: mit
7
  base_model:
8
  - microsoft/Florence-2-large
9
+ datasets:
10
+ - GUIrilla/GUIrilla-Task
11
  ---
12
 
13
+ # GUIrilla-See-0.7B
14
 
15
+ *Lightweight vision–language model for GUI element localisation*
16
+
17
+ ---
18
+
19
+ ## Summary
20
+
21
+ **GUIrilla-See-0.7B** is a 0.7-billion-parameter model derived from **Florence 2-large** and fine-tuned for **open-vocabulary detection** in graphical user-interface (GUI) screenshots.
22
+ Given an image and a free-form textual description, the model returns either
23
+
24
+ * the bounding box of the best-matching element, or
25
+ * a polygon mask, when a bounding box is unavailable.
26
+
27
+ The model is intended for research on lightweight GUI agents, automated testing, and accessibility tools where a small footprint is preferred over the larger counterpart.
28
+
29
+ ---
30
+
31
+ ## Quick-start
32
+
33
+ ```python
34
+ import torch, PIL.Image as Image
35
+ from transformers import AutoModelForCausalLM, AutoProcessor
36
+
37
+ # --- load pipeline -----------------------------------------------------------
38
+ device = "cuda" if torch.cuda.is_available() else "cpu"
39
+ model_name = "GUIrilla/GUIrilla-See-0.7B" # 0.7 B weights
40
+ dtype = torch.bfloat16 if device == "cuda" else torch.float32
41
+
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ model_name, torch_dtype=dtype, trust_remote_code=True
44
+ ).to(device)
45
+
46
+ processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
47
+
48
+ # --- inference ---------------------------------------------------------------
49
+ image = Image.open("screenshot.png").convert("RGB")
50
+ task_prompt = "<OPEN_VOCABULARY_DETECTION>"
51
+ text_query = "button with the label “Submit”"
52
+
53
+ prompt = task_prompt + text_query
54
+ inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device, dtype)
55
+
56
+ with torch.no_grad():
57
+ ids = model.generate(
58
+ input_ids = inputs["input_ids"],
59
+ pixel_values= inputs["pixel_values"],
60
+ max_new_tokens = 1024,
61
+ num_beams = 3,
62
+ do_sample = False,
63
+ early_stopping = False,
64
+ )
65
+
66
+ decoded = processor.batch_decode(ids, skip_special_tokens=False)[0]
67
+ result = processor.post_process_generation(
68
+ decoded, task=task_prompt, image_size=image.size
69
+ )["<OPEN_VOCABULARY_DETECTION>"]
70
+
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Training Data
76
+
77
+ Trained on [GUIrilla-Task](https://huggingface.co/datasets/GUIrilla/GUIrilla-Task).
78
+
79
+ * **Train data:** 25,606 tasks across 881 macOS applications (10% of apps from it for validation)
80
+ * **Test data:** 1,565 tasks across 227 macOS applications
81
+
82
+ ---
83
+
84
+ ## Training Procedure
85
+
86
+ * 4 epochs LoRA fine-tuning on 1 × A100 40 GB.
87
+ * Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 5 e-6 with 0.01 warm up ratio.
88
+
89
+ ---
90
+
91
+ ## Evaluation
92
+
93
+ | Split | Success Rate % |
94
+ | ----- | ---------------|
95
+ | Test | **53.55** |
96
+
97
+ ---
98
+
99
+ ## Ethical & Safety Notes
100
+
101
+ * Always sandbox or use confirmation steps when connecting the model to real GUIs.
102
+ * Screenshots may reveal sensitive data – ensure compliance with privacy regulations.
103
+
104
+ ---
105
+
106
+ ## License
107
+
108
+ MIT (see `LICENSE`).