--- language: - en license: cc-by-4.0 tags: - vision - image-text-to-text - medical - dermatology - multimodal - clip - zero-shot-classification - image-classification pipeline_tag: zero-shot-image-classification library_name: transformers --- # DermLIP: Dermatology Language-Image Pretraining ## Model Description **DermLIP** is a vision-language model for dermatology, trained on the **Derm1M** dataset—the largest dermatological image-text corpus to date. This model variant (`PanDerm-base-w-PubMed-256`) utilizes domain-specific pretraining to deliver superior performance compared to other DermLIP variants.. ### Model Details - **Model Type:** Pretrained Vision-Language Model (CLIP-style) - **Architecture:** - **Vision encoder (PanDerm-base)**: https://github.com/SiyuanYan1/PanDerm - **Text encoder (PubmedBert-256)**: https://huggingface.co/NeuML/pubmedbert-base-embeddings - **Resolution:** 224×224 pixels - **Paper:** https://arxiv.org/abs/2503.14911 - **Repository:** https://github.com/SiyuanYan1/Derm1M - **license:** cc-by-nc-nd-4.0 ## Training Details - **Training data:** 403,563 skin image-text pairs from Derm1M datasets. Images include both dermoscopic and clinical images. - **Training objective:** image-text contrastive loss - **Hardware:** 1 x Nvidia H200(~90GB memory usage) - **Hours used:** ~9.5 hours ## Intended Uses ### Primary Use Cases - Zero-shot classification - Few-shot learning - Cross-modal retrieval - Concept annotation/explanation ## How to Use ### Installation First, clone the Derm1M repository: ```bash git clone git@github.com:SiyuanYan1/Derm1M.git cd Derm1M ``` Then install the package following the instruction in the repository. ### Quick Start ```python import open_clip from PIL import Image import torch # Load model with huggingface checkpoint model, _, preprocess = open_clip.create_model_and_transforms( 'hf-hub:redlessone/DermLIP_PanDerm-base-w-PubMed-256' ) model.eval() # Initialize tokenizer tokenizer = open_clip.get_tokenizer('hf-hub:redlessone/DermLIP_PanDerm-base-w-PubMed-256') # Read example image image = preprocess(Image.open("your_skin_image.png")).unsqueeze(0) # Define disease labels (example: PAD dataset classes) PAD_CLASSNAMES = [ "nevus", "basal cell carcinoma", "actinic keratosis", "seborrheic keratosis", "squamous cell carcinoma", "melanoma" ] # Build text prompts template = lambda c: f'This is a skin image of {c}' text = tokenizer([template(c) for c in PAD_CLASSNAMES]) # Inference with torch.no_grad(), torch.autocast("cuda"): # Encode image and text image_features = model.encode_image(image) text_features = model.encode_text(text) # Normalize features image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Compute similarity text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) # Get prediction final_prediction = PAD_CLASSNAMES[torch.argmax(text_probs[0])] print(f'This image is diagnosed as {final_prediction}.') print("Label probabilities:", text_probs) ``` ## Contact For any additional questions or comments, contact Siyuan Yan (`siyuan.yan@monash.edu`), ## Cite our Paper ```bibtex @misc{yan2025derm1m, title = {Derm1M: A Million‑Scale Vision‑Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology}, author = {Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge}, year = {2025}, eprint = {2503.14911}, archivePrefix= {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2503.14911} } @article{yan2025multimodal, title={A multimodal vision foundation model for clinical dermatology}, author={Yan, Siyuan and Yu, Zhen and Primiero, Clare and Vico-Alonso, Cristina and Wang, Zhonghua and Yang, Litao and Tschandl, Philipp and Hu, Ming and Ju, Lie and Tan, Gin and others}, journal={Nature Medicine}, pages={1--12}, year={2025}, publisher={Nature Publishing Group} } ```