SUR-adapter

This repository is the implementation of "SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models" [paper][code].

Introduction

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation.

Usage

Clone the code and pretrained SUR-adapter.

git clone https://huggingface.co/zhongshsh/SUR-adapter

Then run the demo.ipynb.

import os
os.environ['CUDA_VISIBLE_DEVICES']='0'

from SUR_adapter_pipeline import SURStableDiffusionPipeline
import torch
from SUR_adapter import Adapter


model_path = "SG161222/Realistic_Vision_V2.0"
pipe = SURStableDiffusionPipeline.from_pretrained(model_path)
pipe.to("cuda")
pipe.safety_checker = lambda images, clip_input: (images, False)
adapter=Adapter().to("cuda")
adapter.load_state_dict(torch.load("adapter_checkpoint.pt"))

image = pipe(prompt='A beautiful cat', adapter=adapter).images[0]
image.show()

Citation

@inproceedings{zhong2023adapter,
  title={Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models},
  author={Zhong, Shanshan and Huang, Zhongzhan and Wen, Weushao and Qin, Jinghui and Lin, Liang},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={567--578},
  year={2023}
}

Acknowledgments

Many thanks to Eugene for his SG161222/Realistic_Vision_V2.0 for image generation task, which is used as the pretrained model to be finetuned.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support