File size: 6,846 Bytes
313568e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbbb114
 
 
 
 
 
 
 
 
4025c10
cbbb114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language:
  - en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - multimodal-llm
  - surgical
  - healthcare
base_model: Qwen/Qwen2.5-VL-7B
base_model_relation: finetune
license: other
license_name: nsclv1
license_link: https://huggingface.co/nvidia/Qwen2.5-VL-7B-Surg-CholecT50/resolve/main/License.docx
datasets:
  - CAMMA-public/cholect50
model-index:
  - name: Qwen2.5-VL-7B-Surg-CholecT50
    results:
      - task:
          type: image-text-to-text
          name: Surgical Triplet Recognition
        dataset:
          name: CholecT50
          type: cholect50
        metrics:
          - type: f1
            name: F1 Instrument
            value: 0.81
          - type: f1
            name: F1 Verb
            value: 0.64
          - type: f1
            name: F1 Target
            value: 0.60
---

# Model Overview

### Description:
Qwen2.5-VL-7B-Surg-CholecT50 is a multimodal large language model fine-tuned on the CholecT50 dataset of laparoscopic cholecystectomy procedures to recognize and describe surgical actions, instruments, and targets in endoscopic video frames. Qwen2.5-VL-7B-Surg-CholecT50 was developed by NVIDIA for research in surgical workflow analyses and fine-grained action recognition.<br>

This model is for research and development only.  <br>

### License/Terms of Use

Please see the [NSCLv1 license](./License.docx). <br> 

### Deployment Geography:
Global <br>

### Use Case: <br>
Primarily intended for surgical researchers, healthcare AI developers, or academic institutions exploring laparoscopic action recognition and surgical workflow analytics. <br>

## References(s):
Twinanda, A. P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., & Padoy, N. (2016). [Endonet: a deep architecture for recognition tasks on laparoscopic videos.](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7519080)  <br> 
C.I. Nwoye, N. Padoy. Data Splits and Metrics for Benchmarking Methods on Surgical Action Triplet Datasets. arXiv:2204.05235.  <br> 

## Model Architecture:
**Architecture Type:** Transformer-based Large Language Model with a Vision Adapter <br>

**Network Architecture:** Qwen2.5-VL-7B <br>

**This model was developed based on Qwen2.5-VL-7B** <br> 
** Number of model parameters: ~7.0×10^9** <br>

## Input: <br>
**Input Type(s):** Image (endoscopic frame), (Optional) Text Prompt <br>
**Input Format:** Red, Green, Blue (RGB), String <br>
**Input Parameters:** Image: Two-Dimensional (2D) laparoscopic image frames (extracted at 1 fps), Text: One-Dimensional (1D) <br>
**Other Properties Related to Input:** Recommended resolution: 480p or higher. Minimal resizing (e.g., 224×224) if required by the model’s vision encoder. Token limit for text context: up to ~4k tokens. <br>

## Output: <br>
**Output Type(s):** Text <br>
**Output Format:** String <br>
**Output Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Output:** Returns natural language descriptions of recognized instruments, actions, and targets; no bounding boxes or segmentation maps by default. Downstream systems may parse the text output for analytics. NVIDIA GPUs can significantly reduce inference time. <br>

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br> 

## Software Integration:
**Runtime Engine(s):** Any standard LLM-serving solution (e.g., PyTorch with Triton Inference Server) <br>

**Supported Hardware Microarchitecture Compatibility: ** <br>
* NVIDIA Ampere (e.g., A100) <br>
* NVIDIA Hopper (e.g., H100) <br>

**Preferred/Supported Operating System(s):**
* Linux <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

## Model Version(s):
v1.0 (Finetuned on CholecT50)  <br>

This model may be used with the [MONAI Surgical Agent Framework](https://github.com/Project-MONAI/VLM-Surgical-Agent-Framework)

## Training Dataset:
[CholecT50](https://github.com/CAMMA-public/cholect50) 


** Data Modality <br>
* Image and Text<br>

** Image Training Data Size <br>
* Less than a Million Images <br>

** Text Training Data Size <br>
*  Less than a Billion Tokens <br>

 
 
** Data Collection Method by dataset <br>
* Hybrid: Automated, Human  <br>

** Labeling Method by dataset <br>
* Human <br>

**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~50 laparoscopic cholecystectomy procedures; frames extracted at 1 fps (~100K training frames); annotations include `<instrument, verb, target>` triplets. <br>

### Testing Dataset:

**Link:** CholecT50 (holdout portion)  <br>

Data Collection Method by dataset:  <br>
* Hybrid: Automated, Human <br>

Labeling Method by dataset:  <br>
* Human <br>

**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~1–2K frames for testing (approx). <br>



### Evaluation Dataset:

**Link:** CholecT50 (dedicated set never seen during training)  <br>

**Benchmark Score <br>
F1-score (Triplets): Instrument: 0.81, Verb: 0.64, Target (Anatomy): 0.60 <br>

Data Collection Method by dataset:  <br>
* Hybrid: Automated, Human <br>

Labeling Method by dataset:  <br>
* Human <br>

**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~1–2K frames for final evaluation. <br>


# Inference:
**Acceleration Engine:** vLLM <br>
**Test Hardware:** A6000 <br>  


## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br> 

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).


Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).  <br>