TianheWu commited on
Commit
6e5e7d8
·
verified ·
1 Parent(s): 5a9b35c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +319 -49
README.md CHANGED
@@ -1,19 +1,19 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- base_model:
6
- - Qwen/Qwen2.5-VL-7B-Instruct
7
- pipeline_tag: reinforcement-learning
8
- tags:
9
- - IQA
10
- - Reasoning
11
- - VLM
12
- - Pytorch
13
- - R1
14
- - GRPO
15
- - RL2R
16
- ---
17
 
18
  # VisualQuality-R1-7B
19
  This is the latest version of VisualQuality-R1, trained on a diverse combination of synthetic and realistic datasets.<br>
@@ -26,12 +26,25 @@ Code link: [github](https://github.com/TianheWu/VisualQuality-R1)
26
  <img src="https://cdn-uploads.huggingface.co/production/uploads/655de51982afda0fc479fb91/JZgVeMtAVASCCNYO5VCyn.png" width="600"/>
27
 
28
 
29
- ## Quick Start
30
- This section includes the usages of **VisualQuality-R1**.
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  <details>
33
- <summary>Example Code (Single Image Quality Rating)</summary>
34
-
35
  ```python
36
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
37
  from qwen_vl_utils import process_vision_info
@@ -50,8 +63,7 @@ def score_image(image_path, model, processor):
50
  "First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
51
  )
52
 
53
- QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
54
- # QUESTION_TEMPLATE = "Please describe the quality of this image."
55
  message = [
56
  {
57
  "role": "user",
@@ -85,8 +97,7 @@ def score_image(image_path, model, processor):
85
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
86
  )
87
 
88
- reasoning = re.findall(r'<think>(.*?)</think>', batch_output_text[0], re.DOTALL)
89
- reasoning = reasoning[-1].strip()
90
 
91
  try:
92
  model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
@@ -117,14 +128,13 @@ reasoning, score = score_image(
117
  image_path, model, processor
118
  )
119
 
120
- print(reasoning)
121
  print(score)
122
  ```
123
  </details>
124
 
125
 
126
  <details>
127
- <summary>Example Code (Batch Images Quality Rating)</summary>
128
 
129
  ```python
130
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
@@ -156,7 +166,7 @@ def score_batch_image(image_paths, model, processor):
156
  "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
157
  )
158
 
159
- QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
160
 
161
  messages = []
162
  for img_path in image_paths:
@@ -202,9 +212,6 @@ def score_batch_image(image_paths, model, processor):
202
 
203
  path_score_dict = {}
204
  for img_path, model_output in zip(image_paths, all_outputs):
205
- reasoning = re.findall(r'<think>(.*?)</think>', model_output, re.DOTALL)
206
- reasoning = reasoning[-1].strip()
207
-
208
  try:
209
  model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
210
  model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
@@ -247,11 +254,11 @@ print("Done!")
247
  ```
248
  </details>
249
 
 
250
 
251
  <details>
252
- <summary>Example Code (Images Inference)</summary>
253
-
254
- You can prompt anything what you like in the following commands (including multi-image as input)
255
  ```python
256
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
257
  from qwen_vl_utils import process_vision_info
@@ -262,13 +269,22 @@ import re
262
  import os
263
 
264
 
265
- def generate(image_paths, model, prompt, processor):
 
 
 
 
 
 
 
 
 
266
  message = [
267
  {
268
  "role": "user",
269
  "content": [
270
- *({'type': 'image', 'image': img_path} for img_path in image_paths),
271
- {"type": "text", "text": prompt}
272
  ],
273
  }
274
  ]
@@ -296,16 +312,142 @@ def generate(image_paths, model, prompt, processor):
296
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
297
  )
298
 
299
- return batch_output_text[0]
 
 
 
 
 
 
 
 
 
 
 
300
 
301
 
302
  random.seed(1)
303
  MODEL_PATH = ""
304
  device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
305
- image_path = [
306
- "",
307
- ""
308
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
309
 
310
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
311
  MODEL_PATH,
@@ -316,24 +458,152 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
316
  processor = AutoProcessor.from_pretrained(MODEL_PATH)
317
  processor.tokenizer.padding_side = "left"
318
 
319
- prompt = "Please describe the quality of given two images."
320
- answer = generate(
321
- image_path, model, prompt, processor
 
 
322
  )
323
 
324
- print(answer)
 
 
 
 
 
325
  ```
326
  </details>
327
 
328
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
329
 
330
- ## Related Projects
331
- - [ECCV 2024] [A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment](https://arxiv.org/abs/2403.10854v2)
332
- - [CVPR 2025] [Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption](https://www.arxiv.org/abs/2503.11221)
333
 
334
 
335
  ## 📧 Contact
336
- If you have any question, please email `wth22@mails.tsinghua.edu.cn` or `tianhewu@cityu.edu.hk`.
337
 
338
 
339
  ## BibTeX
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-7B-Instruct
7
+ pipeline_tag: reinforcement-learning
8
+ tags:
9
+ - IQA
10
+ - Reasoning
11
+ - VLM
12
+ - Pytorch
13
+ - R1
14
+ - GRPO
15
+ - RL2R
16
+ ---
17
 
18
  # VisualQuality-R1-7B
19
  This is the latest version of VisualQuality-R1, trained on a diverse combination of synthetic and realistic datasets.<br>
 
26
  <img src="https://cdn-uploads.huggingface.co/production/uploads/655de51982afda0fc479fb91/JZgVeMtAVASCCNYO5VCyn.png" width="600"/>
27
 
28
 
29
+ ## Quick Start
30
+
31
+ ### Non-Thinking Inference
32
+ When you execute inference with VisualQuality-R1 as a reward/evaluation model, you can only use **non-thinking** mode to reduce inference time, generating only a single output token with the following prompt:
33
+ ```
34
+ PROMPT = (
35
+ "You are doing the image quality assessment task. Here is the question: "
36
+ "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
37
+ "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
38
+ )
39
+
40
+ QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
41
+ ```
42
+
43
+ For single image quality rating, the code is:
44
 
45
  <details>
46
+ <summary>Example Code (VisualQuality-R1: Image Quality Rating with non-thinking mode)</summary>
47
+
48
  ```python
49
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
50
  from qwen_vl_utils import process_vision_info
 
63
  "First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
64
  )
65
 
66
+ QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
 
67
  message = [
68
  {
69
  "role": "user",
 
97
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
98
  )
99
 
100
+ reasoning = None
 
101
 
102
  try:
103
  model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
 
128
  image_path, model, processor
129
  )
130
 
 
131
  print(score)
132
  ```
133
  </details>
134
 
135
 
136
  <details>
137
+ <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with non-thinking mode)</summary>
138
 
139
  ```python
140
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 
166
  "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
167
  )
168
 
169
+ QUESTION_TEMPLATE = "{Question} Please only output the final answer with only one score in <answer> </answer> tags."
170
 
171
  messages = []
172
  for img_path in image_paths:
 
212
 
213
  path_score_dict = {}
214
  for img_path, model_output in zip(image_paths, all_outputs):
 
 
 
215
  try:
216
  model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
217
  model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
 
254
  ```
255
  </details>
256
 
257
+ ### Thinking mode for inference
258
 
259
  <details>
260
+ <summary>Example Code (VisualQuality-R1: Single Image Quality Rating with thinking)</summary>
261
+
 
262
  ```python
263
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
264
  from qwen_vl_utils import process_vision_info
 
269
  import os
270
 
271
 
272
+ def score_image(image_path, model, processor):
273
+ PROMPT = (
274
+ "You are doing the image quality assessment task. Here is the question: "
275
+ "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
276
+ "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality. "
277
+ "First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
278
+ )
279
+
280
+ QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
281
+ # QUESTION_TEMPLATE = "Please describe the quality of this image."
282
  message = [
283
  {
284
  "role": "user",
285
  "content": [
286
+ {'type': 'image', 'image': image_path},
287
+ {"type": "text", "text": PROMPT}
288
  ],
289
  }
290
  ]
 
312
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
313
  )
314
 
315
+ reasoning = re.findall(r'<think>(.*?)</think>', batch_output_text[0], re.DOTALL)
316
+ reasoning = reasoning[-1].strip()
317
+
318
+ try:
319
+ model_output_matches = re.findall(r'<answer>(.*?)</answer>', batch_output_text[0], re.DOTALL)
320
+ model_answer = model_output_matches[-1].strip() if model_output_matches else batch_output_text[0].strip()
321
+ score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
322
+ except:
323
+ print(f"================= Meet error with {img_path}, please generate again. =================")
324
+ score = random.randint(1, 5)
325
+
326
+ return reasoning, score
327
 
328
 
329
  random.seed(1)
330
  MODEL_PATH = ""
331
  device = torch.device("cuda:5") if torch.cuda.is_available() else torch.device("cpu")
332
+ image_path = ""
333
+
334
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
335
+ MODEL_PATH,
336
+ torch_dtype=torch.bfloat16,
337
+ attn_implementation="flash_attention_2",
338
+ device_map=device,
339
+ )
340
+ processor = AutoProcessor.from_pretrained(MODEL_PATH)
341
+ processor.tokenizer.padding_side = "left"
342
+
343
+ reasoning, score = score_image(
344
+ image_path, model, processor
345
+ )
346
+
347
+ print(reasoning)
348
+ print(score)
349
+ ```
350
+ </details>
351
+
352
+
353
+ <details>
354
+ <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking)</summary>
355
+
356
+ ```python
357
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
358
+ from qwen_vl_utils import process_vision_info
359
+ from tqdm import tqdm
360
+
361
+ import torch
362
+ import random
363
+ import re
364
+ import os
365
+
366
+
367
+ def get_image_paths(folder_path):
368
+ image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
369
+ image_paths = []
370
+
371
+ for root, dirs, files in os.walk(folder_path):
372
+ for file in files:
373
+ _, ext = os.path.splitext(file)
374
+ if ext.lower() in image_extensions:
375
+ image_paths.append(os.path.join(root, file))
376
+
377
+ return image_paths
378
+
379
+ def score_batch_image(image_paths, model, processor):
380
+ PROMPT = (
381
+ "You are doing the image quality assessment task. Here is the question: "
382
+ "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
383
+ "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
384
+ )
385
+
386
+ QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
387
+
388
+ messages = []
389
+ for img_path in image_paths:
390
+ message = [
391
+ {
392
+ "role": "user",
393
+ "content": [
394
+ {'type': 'image', 'image': img_path},
395
+ {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
396
+ ],
397
+ }
398
+ ]
399
+ messages.append(message)
400
+
401
+ BSZ = 32
402
+ all_outputs = [] # List to store all answers
403
+ for i in tqdm(range(0, len(messages), BSZ)):
404
+ batch_messages = messages[i:i + BSZ]
405
+
406
+ # Preparation for inference
407
+ text = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in batch_messages]
408
+
409
+ image_inputs, video_inputs = process_vision_info(batch_messages)
410
+ inputs = processor(
411
+ text=text,
412
+ images=image_inputs,
413
+ videos=video_inputs,
414
+ padding=True,
415
+ return_tensors="pt",
416
+ )
417
+ inputs = inputs.to(device)
418
+
419
+ # Inference: Generation of the output
420
+ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=512, do_sample=True, top_k=50, top_p=1)
421
+ generated_ids_trimmed = [
422
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
423
+ ]
424
+ batch_output_text = processor.batch_decode(
425
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
426
+ )
427
+
428
+ all_outputs.extend(batch_output_text)
429
+
430
+ path_score_dict = {}
431
+ for img_path, model_output in zip(image_paths, all_outputs):
432
+ reasoning = re.findall(r'<think>(.*?)</think>', model_output, re.DOTALL)
433
+ reasoning = reasoning[-1].strip()
434
+
435
+ try:
436
+ model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
437
+ model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
438
+ score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
439
+ except:
440
+ print(f"Meet error with {img_path}, please generate again.")
441
+ score = random.randint(1, 5)
442
+
443
+ path_score_dict[img_path] = score
444
+
445
+ return path_score_dict
446
+
447
+
448
+ random.seed(1)
449
+ MODEL_PATH = ""
450
+ device = torch.device("cuda:3") if torch.cuda.is_available() else torch.device("cpu")
451
 
452
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
453
  MODEL_PATH,
 
458
  processor = AutoProcessor.from_pretrained(MODEL_PATH)
459
  processor.tokenizer.padding_side = "left"
460
 
461
+ image_root = ""
462
+ image_paths = get_image_paths(image_root) # It should be a list
463
+
464
+ path_score_dict = score_batch_image(
465
+ image_paths, model, processor
466
  )
467
 
468
+ file_name = "output.txt"
469
+ with open(file_name, "w") as file:
470
+ for key, value in path_score_dict.items():
471
+ file.write(f"{key} {value}\n")
472
+
473
+ print("Done!")
474
  ```
475
  </details>
476
 
477
 
478
+ ## 🚀 Updated: VisualQuality-R1 high efficiency inference script with vLLM
479
+
480
+ <details>
481
+ <summary>Example Code (VisualQuality-R1: Batch Images Quality Rating with thinking, using vLLM)</summary>
482
+
483
+ ```python
484
+ # Please install vLLM first: https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html
485
+
486
+ from transformers import Qwen2_5_VLProcessor, AutoProcessor
487
+ from vllm import LLM, RequestOutput, SamplingParams
488
+ from qwen_vl_utils import process_vision_info
489
+
490
+ import torch
491
+ import random
492
+ import re
493
+ import os
494
+
495
+ IMAGE_PATH = "./images"
496
+ MODEL_PATH = "TianheWu/VisualQuality-R1-7B"
497
+
498
+ def get_image_paths(folder_path):
499
+ image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
500
+ image_paths = []
501
+
502
+ for root, dirs, files in os.walk(folder_path):
503
+ for file in files:
504
+ _, ext = os.path.splitext(file)
505
+ if ext.lower() in image_extensions:
506
+ image_paths.append(os.path.join(root, file))
507
+
508
+ return image_paths
509
+
510
+ def score_batch_image(image_paths, model: LLM, processor: Qwen2_5_VLProcessor):
511
+ PROMPT = (
512
+ "You are doing the image quality assessment task. Here is the question: "
513
+ "What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, "
514
+ "rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality."
515
+ )
516
+
517
+ QUESTION_TEMPLATE = "{Question} First output the thinking process in <think> </think> tags and then output the final answer with only one score in <answer> </answer> tags."
518
+
519
+ messages = []
520
+ for img_path in image_paths:
521
+ message = [
522
+ {
523
+ "role": "user",
524
+ "content": [
525
+ {'type': 'image', 'image': img_path},
526
+ {"type": "text", "text": QUESTION_TEMPLATE.format(Question=PROMPT)}
527
+ ],
528
+ }
529
+ ]
530
+ messages.append(message)
531
+
532
+ all_outputs = [] # List to store all answers
533
+
534
+ # Preparation for inference
535
+ print("preprocessing ...")
536
+ texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True, add_vision_id=True) for msg in messages]
537
+ image_inputs, video_inputs = process_vision_info(messages)
538
+
539
+ inputs = [{
540
+ "prompt": texts[i],
541
+ "multi_modal_data": {
542
+ "image": image_inputs[i]
543
+ },
544
+ } for i in range(len(messages))]
545
+
546
+ output: list[RequestOutput] = model.generate(
547
+ inputs,
548
+ sampling_params=SamplingParams(
549
+ max_tokens=512,
550
+ temperature=0.1,
551
+ top_k=50,
552
+ top_p=1.0,
553
+ stop_token_ids=[processor.tokenizer.eos_token_id],
554
+ ),
555
+ )
556
+
557
+ batch_output_text = [o.outputs[0].text for o in output]
558
+
559
+ all_outputs.extend(batch_output_text)
560
+
561
+ path_score_dict = {}
562
+ for img_path, model_output in zip(image_paths, all_outputs):
563
+ print(f"{model_output = }")
564
+ try:
565
+ model_output_matches = re.findall(r'<answer>(.*?)</answer>', model_output, re.DOTALL)
566
+ model_answer = model_output_matches[-1].strip() if model_output_matches else model_output.strip()
567
+ score = float(re.search(r'\d+(\.\d+)?', model_answer).group())
568
+ except:
569
+ print(f"Meet error with {img_path}, please generate again.")
570
+ score = random.randint(1, 5)
571
+
572
+ path_score_dict[img_path] = score
573
+
574
+ return path_score_dict
575
+
576
+
577
+ random.seed(1)
578
+ model = LLM(
579
+ model=MODEL_PATH,
580
+ tensor_parallel_size=1,
581
+ trust_remote_code=True,
582
+ seed=1,
583
+ )
584
+
585
+ processor = AutoProcessor.from_pretrained(MODEL_PATH)
586
+ processor.tokenizer.padding_side = "left"
587
+
588
+ image_paths = get_image_paths(IMAGE_PATH) # It should be a list
589
+
590
+ path_score_dict = score_batch_image(
591
+ image_paths, model, processor
592
+ )
593
+
594
+ file_name = "output.txt"
595
+ with open(file_name, "w") as file:
596
+ for key, value in path_score_dict.items():
597
+ file.write(f"{key} {value}\n")
598
+
599
+ print("Done!")
600
+ ```
601
+ </details>
602
 
 
 
 
603
 
604
 
605
  ## 📧 Contact
606
+ If you have any question, please email `wth22@mails.tsinghua.edu.cn` or `tianhewu-c@my.cityu.edu.hk`.
607
 
608
 
609
  ## BibTeX