|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- vision |
|
|
- image-text-to-text |
|
|
- medical |
|
|
- dermatology |
|
|
- multimodal |
|
|
- clip |
|
|
- zero-shot-classification |
|
|
- image-classification |
|
|
pipeline_tag: zero-shot-image-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# DermLIP: Dermatology Language-Image Pretraining |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**DermLIP** is a vision-language model for dermatology, trained on the **Derm1M** dataset—the largest dermatological image-text corpus to date. This model variant (`PanDerm-base-w-PubMed-256`) utilizes domain-specific pretraining to deliver superior performance compared to other DermLIP variants.. |
|
|
|
|
|
### Model Details |
|
|
|
|
|
- **Model Type:** Pretrained Vision-Language Model (CLIP-style) |
|
|
|
|
|
- **Architecture:** |
|
|
|
|
|
- **Vision encoder (PanDerm-base)**: https://github.com/SiyuanYan1/PanDerm |
|
|
- **Text encoder (PubmedBert-256)**: https://huggingface.co/NeuML/pubmedbert-base-embeddings |
|
|
|
|
|
- **Resolution:** 224×224 pixels |
|
|
|
|
|
- **Paper:** https://arxiv.org/abs/2503.14911 |
|
|
|
|
|
- **Repository:** https://github.com/SiyuanYan1/Derm1M |
|
|
|
|
|
- **license:** cc-by-nc-nd-4.0 |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training data:** 403,563 skin image-text pairs from Derm1M datasets. Images include both dermoscopic and clinical images. |
|
|
- **Training objective:** image-text contrastive loss |
|
|
- **Hardware:** 1 x Nvidia H200(~90GB memory usage) |
|
|
- **Hours used:** ~9.5 hours |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
- Zero-shot classification |
|
|
- Few-shot learning |
|
|
- Cross-modal retrieval |
|
|
- Concept annotation/explanation |
|
|
|
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
First, clone the Derm1M repository: |
|
|
```bash |
|
|
git clone git@github.com:SiyuanYan1/Derm1M.git |
|
|
cd Derm1M |
|
|
``` |
|
|
|
|
|
Then install the package following the instruction in the repository. |
|
|
|
|
|
### Quick Start |
|
|
```python |
|
|
import open_clip |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model with huggingface checkpoint |
|
|
model, _, preprocess = open_clip.create_model_and_transforms( |
|
|
'hf-hub:redlessone/DermLIP_PanDerm-base-w-PubMed-256' |
|
|
) |
|
|
model.eval() |
|
|
|
|
|
# Initialize tokenizer |
|
|
tokenizer = open_clip.get_tokenizer('hf-hub:redlessone/DermLIP_PanDerm-base-w-PubMed-256') |
|
|
|
|
|
# Read example image |
|
|
image = preprocess(Image.open("your_skin_image.png")).unsqueeze(0) |
|
|
|
|
|
# Define disease labels (example: PAD dataset classes) |
|
|
PAD_CLASSNAMES = [ |
|
|
"nevus", |
|
|
"basal cell carcinoma", |
|
|
"actinic keratosis", |
|
|
"seborrheic keratosis", |
|
|
"squamous cell carcinoma", |
|
|
"melanoma" |
|
|
] |
|
|
|
|
|
# Build text prompts |
|
|
template = lambda c: f'This is a skin image of {c}' |
|
|
text = tokenizer([template(c) for c in PAD_CLASSNAMES]) |
|
|
|
|
|
# Inference |
|
|
with torch.no_grad(), torch.autocast("cuda"): |
|
|
# Encode image and text |
|
|
image_features = model.encode_image(image) |
|
|
text_features = model.encode_text(text) |
|
|
|
|
|
# Normalize features |
|
|
image_features /= image_features.norm(dim=-1, keepdim=True) |
|
|
text_features /= text_features.norm(dim=-1, keepdim=True) |
|
|
|
|
|
# Compute similarity |
|
|
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) |
|
|
|
|
|
# Get prediction |
|
|
final_prediction = PAD_CLASSNAMES[torch.argmax(text_probs[0])] |
|
|
print(f'This image is diagnosed as {final_prediction}.') |
|
|
print("Label probabilities:", text_probs) |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For any additional questions or comments, contact Siyuan Yan (`siyuan.yan@monash.edu`), |
|
|
|
|
|
## Cite our Paper |
|
|
```bibtex |
|
|
@misc{yan2025derm1m, |
|
|
title = {Derm1M: A Million‑Scale Vision‑Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology}, |
|
|
author = {Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge}, |
|
|
year = {2025}, |
|
|
eprint = {2503.14911}, |
|
|
archivePrefix= {arXiv}, |
|
|
primaryClass = {cs.CV}, |
|
|
url = {https://arxiv.org/abs/2503.14911} |
|
|
} |
|
|
|
|
|
@article{yan2025multimodal, |
|
|
title={A multimodal vision foundation model for clinical dermatology}, |
|
|
author={Yan, Siyuan and Yu, Zhen and Primiero, Clare and Vico-Alonso, Cristina and Wang, Zhonghua and Yang, Litao and Tschandl, Philipp and Hu, Ming and Ju, Lie and Tan, Gin and others}, |
|
|
journal={Nature Medicine}, |
|
|
pages={1--12}, |
|
|
year={2025}, |
|
|
publisher={Nature Publishing Group} |
|
|
} |
|
|
``` |