File size: 6,846 Bytes
313568e cbbb114 4025c10 cbbb114 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- multimodal-llm
- surgical
- healthcare
base_model: Qwen/Qwen2.5-VL-7B
base_model_relation: finetune
license: other
license_name: nsclv1
license_link: https://huggingface.co/nvidia/Qwen2.5-VL-7B-Surg-CholecT50/resolve/main/License.docx
datasets:
- CAMMA-public/cholect50
model-index:
- name: Qwen2.5-VL-7B-Surg-CholecT50
results:
- task:
type: image-text-to-text
name: Surgical Triplet Recognition
dataset:
name: CholecT50
type: cholect50
metrics:
- type: f1
name: F1 Instrument
value: 0.81
- type: f1
name: F1 Verb
value: 0.64
- type: f1
name: F1 Target
value: 0.60
---
# Model Overview
### Description:
Qwen2.5-VL-7B-Surg-CholecT50 is a multimodal large language model fine-tuned on the CholecT50 dataset of laparoscopic cholecystectomy procedures to recognize and describe surgical actions, instruments, and targets in endoscopic video frames. Qwen2.5-VL-7B-Surg-CholecT50 was developed by NVIDIA for research in surgical workflow analyses and fine-grained action recognition.<br>
This model is for research and development only. <br>
### License/Terms of Use
Please see the [NSCLv1 license](./License.docx). <br>
### Deployment Geography:
Global <br>
### Use Case: <br>
Primarily intended for surgical researchers, healthcare AI developers, or academic institutions exploring laparoscopic action recognition and surgical workflow analytics. <br>
## References(s):
Twinanda, A. P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., & Padoy, N. (2016). [Endonet: a deep architecture for recognition tasks on laparoscopic videos.](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7519080) <br>
C.I. Nwoye, N. Padoy. Data Splits and Metrics for Benchmarking Methods on Surgical Action Triplet Datasets. arXiv:2204.05235. <br>
## Model Architecture:
**Architecture Type:** Transformer-based Large Language Model with a Vision Adapter <br>
**Network Architecture:** Qwen2.5-VL-7B <br>
**This model was developed based on Qwen2.5-VL-7B** <br>
** Number of model parameters: ~7.0×10^9** <br>
## Input: <br>
**Input Type(s):** Image (endoscopic frame), (Optional) Text Prompt <br>
**Input Format:** Red, Green, Blue (RGB), String <br>
**Input Parameters:** Image: Two-Dimensional (2D) laparoscopic image frames (extracted at 1 fps), Text: One-Dimensional (1D) <br>
**Other Properties Related to Input:** Recommended resolution: 480p or higher. Minimal resizing (e.g., 224×224) if required by the model’s vision encoder. Token limit for text context: up to ~4k tokens. <br>
## Output: <br>
**Output Type(s):** Text <br>
**Output Format:** String <br>
**Output Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Output:** Returns natural language descriptions of recognized instruments, actions, and targets; no bounding boxes or segmentation maps by default. Downstream systems may parse the text output for analytics. NVIDIA GPUs can significantly reduce inference time. <br>
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
## Software Integration:
**Runtime Engine(s):** Any standard LLM-serving solution (e.g., PyTorch with Triton Inference Server) <br>
**Supported Hardware Microarchitecture Compatibility: ** <br>
* NVIDIA Ampere (e.g., A100) <br>
* NVIDIA Hopper (e.g., H100) <br>
**Preferred/Supported Operating System(s):**
* Linux <br>
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
## Model Version(s):
v1.0 (Finetuned on CholecT50) <br>
This model may be used with the [MONAI Surgical Agent Framework](https://github.com/Project-MONAI/VLM-Surgical-Agent-Framework)
## Training Dataset:
[CholecT50](https://github.com/CAMMA-public/cholect50)
** Data Modality <br>
* Image and Text<br>
** Image Training Data Size <br>
* Less than a Million Images <br>
** Text Training Data Size <br>
* Less than a Billion Tokens <br>
** Data Collection Method by dataset <br>
* Hybrid: Automated, Human <br>
** Labeling Method by dataset <br>
* Human <br>
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~50 laparoscopic cholecystectomy procedures; frames extracted at 1 fps (~100K training frames); annotations include `<instrument, verb, target>` triplets. <br>
### Testing Dataset:
**Link:** CholecT50 (holdout portion) <br>
Data Collection Method by dataset: <br>
* Hybrid: Automated, Human <br>
Labeling Method by dataset: <br>
* Human <br>
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~1–2K frames for testing (approx). <br>
### Evaluation Dataset:
**Link:** CholecT50 (dedicated set never seen during training) <br>
**Benchmark Score <br>
F1-score (Triplets): Instrument: 0.81, Verb: 0.64, Target (Anatomy): 0.60 <br>
Data Collection Method by dataset: <br>
* Hybrid: Automated, Human <br>
Labeling Method by dataset: <br>
* Human <br>
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~1–2K frames for final evaluation. <br>
# Inference:
**Acceleration Engine:** vLLM <br>
**Test Hardware:** A6000 <br>
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). <br>
|