--- license: cc-by-nc-4.0 base_model: - stabilityai/stable-diffusion-3-medium-diffusers pipeline_tag: image-to-image tags: - image-generation - image-to-image - virtual-try-on - virtual-try-off - diffusion - dit - stable-diffusion-3 - multimodal - fashion - pytorch language: en datasets: - dresscode - viton-hd ---

TEMU-VTOFF

Text-Enhanced MUlti-category Virtual Try-Off

**Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals** [Davide Lobba](https://scholar.google.com/citations?user=WEMoLPEAAAAJ&hl=en&oi=ao)^1,2,\*, [Fulvio Sanguigni](https://scholar.google.com/citations?user=tSpzMUEAAAAJ&hl=en)^2,3,\*, [Bin Ren](https://scholar.google.com/citations?user=Md9maLYAAAAJ&hl=en)^1,2, [Marcella Cornia](https://scholar.google.com/citations?user=DzgmSJEAAAAJ&hl=en)³, [Rita Cucchiara](https://scholar.google.com/citations?user=OM3sZEoAAAAJ&hl=en)³, [Nicu Sebe](https://scholar.google.com/citations?user=stFCYOAAAAAJ&hl=en)¹ ¹University of Trento, ²University of Pisa, ³University of Modena and Reggio Emilia ^* Equal contribution

## 💡 Model Description **TEMU-VTOFF** is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2. This model is based on `stabilityai/stable-diffusion-3-medium-diffusers`. The uploaded weights correspond to the finetuned feature extractor and the VTOFF DiT module. ## ✨ Key Features Our contribution can be summarized as follows: - **🎯 Multi-Category Try-Off**. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines. - **🔗 Multimodal Hybrid Attention**. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately. - **⚡ Garment Aligner Module**. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention. - **📊 Extensive experiments**. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities.