File size: 8,562 Bytes
c091460 a2d0075 49198af af5645a c091460 c840745 c091460 0878a30 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
license: apache-2.0
datasets:
- VisualCloze/Graph200K
pipeline_tag: image-to-image
library_name: diffusers
base_model:
- black-forest-labs/FLUX.1-Fill-dev
tags:
- text-to-image
- image-to-image
- flux
- lora
- in-context-learning
- universal-image-generation
- ai-tools
- VisualCloze
- VisualClozePipeline
---
# VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning (LoRA weight for with <strong><span style="color:red">Diffusers</span></strong>)
<div align="center">
[[Paper](https://arxiv.org/abs/2504.07960)]   [[Project Page](https://visualcloze.github.io/)]   [[Github](https://github.com/lzyhha/VisualCloze)]
</div>
<div align="center">
[[π€ <strong><span style="color:hotpink">Diffusers</span></strong> Implementation](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze)]
</div>
<div align="center">
[[π€ Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze)]   [[π€ Full Model Card](https://huggingface.co/VisualCloze/VisualClozePipeline-384)]   [[π€ Dataset Card](https://huggingface.co/datasets/VisualCloze/Graph200K)]
</div>

If you find VisualCloze is helpful, please consider to star β the [<strong><span style="color:hotpink">Github Repo</span></strong>](https://github.com/lzyhha/VisualCloze). Thanks!
## π° News
- [2025-5-15] π€π€π€ VisualCloze has been merged into the [<strong><span style="color:hotpink">official pipelines of diffusers</span></strong>](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze).
## π Key Features
An in-context learning based universal image generation framework.
1. Support various in-domain tasks.
2. Generalize to <strong><span style="color:hotpink"> unseen tasks</span></strong> through in-context learning.
3. Unify multiple tasks into one step and generate both target image and intermediate results.
4. Support reverse-engineering a set of conditions from a target image.
π₯ Examples are shown in the [project page](https://visualcloze.github.io/).
## π§ Installation
<strong><span style="color:hotpink">You can install the official </span></strong> [diffusers](https://github.com/huggingface/diffusers.git).
```bash
pip install git+https://github.com/huggingface/diffusers.git
```
### π» Diffusers Usage
[](https://huggingface.co/spaces/VisualCloze/VisualCloze)
The weights in this model contains the LoRA weights supporting diffusers and
[Full Model Card](https://huggingface.co/VisualCloze/VisualClozePipeline-384) provides the full weights.
While this model uses the `resolution` of 384, a model trained with the `resolution` of 512 is
released at [Full Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-512) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512).
The `resolution` means that each image will be resized to it before being
concatenated to avoid the out-of-memory error. To generate high-resolution images, we use the SDEdit technology for upsampling the generated results.
#### Example with Depth-to-Image:
<img src="./visualcloze_diffusers_example_depthtoimage.jpg" width="60%" height="50%" alt="Example with Depth-to-Image"/>
```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image
# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
# in-context examples
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2_depth-anything-v2_Large.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2.jpg'),
],
# query with the target image
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/79f2ee632f1be3ad64210a641c4e201b/79f2ee632f1be3ad64210a641c4e201b_depth-anything-v2_Large.jpg'),
None, # No image needed for the query in this case
],
]
# Task and content prompt
task_prompt = "Each row outlines a logical process, starting from [IMAGE1] gray-based depth map with detailed object contours, to achieve [IMAGE2] an image with flawless clarity."
content_prompt = """A serene portrait of a young woman with long dark hair, wearing a beige dress with intricate
gold embroidery, standing in a softly lit room. She holds a large bouquet of pale pink roses in a black box,
positioned in the center of the frame. The background features a tall green plant to the left and a framed artwork
on the wall to the right. A window on the left allows natural light to gently illuminate the scene.
The woman gazes down at the bouquet with a calm expression. Soft natural lighting, warm color palette,
high contrast, photorealistic, intimate, elegant, visually balanced, serene atmosphere."""
# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=384, torch_dtype=torch.bfloat16)
pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-384', weight_name='visualcloze-lora-384.safetensors')
pipe.to("cuda")
# Run the pipeline
image_result = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
upsampling_width=1024,
upsampling_height=1024,
upsampling_strength=0.4,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]
# Save the resulting image
image_result.save("visualcloze.png")
```
#### Example with Virtual Try-On:
<img src="./visualcloze_diffusers_example_tryon.jpg" width="60%" height="50%" alt="Example with Virtual Try-On"/>
```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image
# Load in-context images (make sure the paths are correct and accessible)
# The images are from the VITON-HD dataset at https://github.com/shadow2496/VITON-HD
image_paths = [
# in-context examples
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/03673_00.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00_tryon_catvton_0.jpg'),
],
# query with the target image
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00555_00.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/12265_00.jpg'),
None
],
]
# Task and content prompt
task_prompt = "Each row shows a virtual try-on process that aims to put [IMAGE2] the clothing onto [IMAGE1] the person, producing [IMAGE3] the person wearing the new clothing."
content_prompt = None
# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=384, torch_dtype=torch.bfloat16)
pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-384', weight_name='visualcloze-lora-384.safetensors')
pipe.to("cuda")
# Run the pipeline
image_result = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
upsampling_height=1632,
upsampling_width=1232,
upsampling_strength=0.3,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]
# Save the resulting image
image_result.save("visualcloze.png")
```
### Citation
If you find VisualCloze useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{li2025visualcloze,
title={VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning},
author={Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming},
journal={arXiv preprint arXiv:2504.07960},
year={2025}
}
```
### Acknowledgement
The LoRA weights are converted with the help of [NagaSaiAbhinay](https://huggingface.co/NagaSaiAbhinay) at the [discussion](https://huggingface.co/VisualCloze/VisualClozePipeline-384/discussions/1).
|